lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Lucene TermsFilter lookup slow
Date Mon, 10 Aug 2015 12:46:35 GMT
OK, indeed, that version has the changes I was thinking of,
specifically optimizing the case when only a single doc contains a
term by inlining that docID into the terms dict.

You should be able to improve on TermsFilter a bit because you know
only 1 doc matches each ID, so after the first segment finds a given
ID you should stop testing other segments.  Also, since you are doing
bulk lookup, you should pre-sort the IDs so it's a sequential scan
through the terms dict.

There is another thread right now, subject "Mapping doc values back to
doc ID (in decent time)", also talking about how to do faster PK
lookups.

Mike McCandless

http://blog.mikemccandless.com

On Sun, Aug 9, 2015 at 3:17 AM, jamie <jamie@stimulussoft.com> wrote:
> Mike
>
> Thank you kindly for the reply. I am using Lucene v4.10.4. Are the
> optimization you refer to, available in this version?
>
> We haven't yet upgraded to Lucene 5 as there appear to be many API changes.
>
> Jamie
>
>
> On 2015/08/08 5:13 PM, Michael McCandless wrote:
>>
>> Which version of Lucene are you using?  Newer versions have optimized
>> the "primary key" use case somewhat...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sat, Aug 8, 2015 at 8:32 AM, jamie <jamie@stimulussoft.com> wrote:
>>>
>>> Greetings
>>>
>>> Our app primarily uses Lucene for its intended purpose i.e. to search
>>> across
>>> large amounts of unstructured text. However, recently our requirement
>>> expanded to perform look-ups on specific documents in the index based on
>>> associated custom defined unique keys. For our purposes, a unique key is
>>> the
>>> string representation of a 128 bit murmur hash, stored in a Lucene field
>>> named uid.  We are currently using the TermsFilter to lookup Documents in
>>> the Lucene index as follows:
>>>
>>> List<Term> terms = new LinkedList<>();
>>>              for (String id : ids) {
>>>                  terms.add(new Term("uid", id));
>>> }
>>> TermsFilter idFilter = new TermsFilter(terms);
>>> ... search logic...
>>>
>>> At any time we may need to lookup say a couple of thousand documents. Our
>>> problem is one of performance. On very large indexes with 30 million
>>> records
>>> or more, the lookup can be excruciatingly slow. At this stage, its not
>>> practical for us to move the data over to fit for purpose database, nor
>>> change the uid field to a numeric type. I fully appreciate the fact that
>>> Lucene is not designed to be a database, however, is there anything we
>>> can
>>> do to improve the performance of these look-ups?
>>>
>>> Much appreciate
>>>
>>> Jamie
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message