lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Conny Gyllendahl <conny.gyllend...@gmail.com>
Subject Search returning the correct number of hits but wrong stored data
Date Mon, 30 May 2016 12:32:06 GMT
I have been fighting for some time with trying to fix an issues in the
application I am developing and since I am starting to run out of ideas I
figured I'd try reaching out for help.

I store a couple of million records in Ehcache and wish to use Lucene to
quickly find the keys of the elements I need.

My cache key has the following fields:

subscriberId - String, LongField (Lucene 5.5) or LongPoint + StoredField
 (Lucene 6.0)
date - Integer, IntField (Lucene 5.5) or IntPoint + StoredField (Lucene 6.0)
hour - Integer, StoredField
networkId - Long, StoredField
sessionId - Long, StoredField

(for example the date is converted to an Integer like 20160530 and then
stored)

The above allows me to do quick range querys like: +subscriberId:[12345
TO 12345] +date:[20160501 TO 20160531]

I have written my own Collector that extends SimpleCollector and just adds
the document ids to a Set<Integer>.
After the search I loop through the set, call IndexSearcher.doc(id) to get
the document, create my cache key object from the fields and get the
element from the cache using the key.

I have an Ehcache CacheEventListener which:
- when an element is added to the cache: add a Document to Lucene with the
fields from the key
- when an element is removed from the cache: remove the Document from
Lucene with the Term from the key

When the application starts it reads all entries from the database in a
serial fashion and everything is fine.

However then the application launches several threads which consumes
messages from a message queue and adds them to the cache (which in turn
adds them to Lucene through the listener) (we get a burst of 2000-3000
messages every 5 minutes).

And this is where I run in to problems, a search will return the correct
number of hits (verified against database) but a number (not all) of the
documents are not the correct ones (they contain values for another
subscriber/date/etc).

At startup I create the Directory and IndexWriter in a synchronized block
so all threads/instances use a single shared IndexWriter.

I have tried three ways of reading/searching:
- DirectoryReader.open(IndexWriter)
- DirectoryReader.open(<Directory created at startup and used to create
IndexWriter>)
- DirectoryReader.open(new
Directory(FSDirectory.open(Paths.get(indexDirectory))))
I have also tried with and without IndexWriter.commit() after each
addDocument and deleteDocument

I must get all documents when I do the search, but getting deleted
documents is not an issue.
I create/close a new IndexReader for each search request.

Clearly things work as long as it runs in a serial fashion but once it
starts consuming messages from the queue it runs into problems. One the
problem appears if manifests itself even if there are no more writes to the
index (i.e. we stop it from consuming new messages and then try a single
search which will create a new IndexReader).

I have also noticed that it is only if the search includes the most recent
date added to the application in the search that I get this issue. So given
this example:
- data from 20160501 to 20160510 read from database on startup
- data for 20160511 and 20160512 received from message queue
A search for date:[20160501 TO 20160511] = no problem
A search for date:[20160601 TO 20160512] = problems

Any ideas on what I am doing wrong? I only started using Lucene a few weeks
ago so all I have so far is from reading the API docs and various online
examples.

Regards,
Conny

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message