Another real-time search and indexing system built on Apache Lucene.

From their website:

Zoie is a mature open source project and has been deployed in a real-time large-scale consumer website: handling millions of searches as well as hundreds of thousands of updates daily.

All Zoie releases have gone through extensive functional and performance testing by LinkedIn before made public. All major versions are released after a trial period on the production environment.

In a real-time search/indexing system, a document is made available as soon as it is added to the index. This functionality is especially important to time-sensitive information such as news, job openings, tweets etc.

Clustering lucene

Beside using Apache Solr or Katta, this article describes many ways to cluster a Lucene index:

  1. Use a shared file system between all nodes, and use FSDirectory.
  2. Use indexes on the nodes local file system and a synchronization strategy.
  3. Use a database using JDBCDirectory
  4. Use a distributed file system (eg Google File System, Nutch Distributed File System)
  5. Use a local cache with backup in the Database

Some other ways to distribute the index are discussed here. A document written at HP describes a parallel, distributed free text index called Distributed Lucene. This document from IBM gives some feelings about scaling-out versus scaling up using Nutch and Lucene.

A novel way is to use TerraCotta and Compass to cluster the index as described here.