lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bakker <>
Subject Search trough versioned data
Date Wed, 09 Dec 2015 08:17:32 GMT

We have a need for search for data in a facetted and 'normal' way. For this we use Lucene
at the moment. Our data is sharded in equal sized 'blocks' and our Lucene indexes follow this
shards of data. Each block is ( between 256MB and 4GB ).  So indexes are always for 256MB
till 4GB of data. If the data grows it will be split (and the index as well).

Currently this setup works. Our Lucene indexes split if we get new blocks of data and fill
again till the maximum size is reached again.

But now we have an extra requirement. We need to do the search in a versioned way (our data
is versioned). For us a version is a change in the total dataset with a transaction precision
of microseconds.

We see a few possibilities:

1. Use the Lucene commit points and keep the data forever with NoDeletionPolicy. With this
we think we can not scala it to millions of different commit points. If I read the documentation
correct each commit will give me now an extra file and that will not really scale.

2. Save extra versions of the documents on each update while using extra fields in the index
and the facet index

from, till (long) for normal search and extra facets for the from and extra facets for the
till 'timestamp' (date split in 5 facets with 300 - 1000 unique values each) to speed up the
facetted search.

I hope there is some other possible solution which I don't know of.

Our requirements:

-Search through 4GB of data on documents with 5-100 fields and 3-15 facets each.

-Have a response time < 100ms.

-Be able to do 20 queries per second

-Be able to search trough each 'snapshot' where a snapshot is defined as a change in the total
dataset. Snapshots have a time precision of 1 millionth of a second.

The questions I have:

1. What will be the most appropriate way to implement such a search 1,2 or an other solution?

2. In case 1, will this be a solution worth looking into?

3. In case 2 will Lucene be efficient with documents which look quite the same (different
versions of the same document)?

4. In case 2 will this solve our requirements?

Kind regards,

Mark Bakker

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message