lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florian Hopf <mailingli...@florian-hopf.de>
Subject Re: Understanding performance characteristics of the new point types
Date Wed, 02 Nov 2016 19:19:17 GMT
Thank you both for the explanation, we will switch to StringField with a
TermQuery instead.

On 02.11.2016 20:09, Michael McCandless wrote:
> Yeah it's best to use StringField for low-cardinality use cases.
> 
> When cardinality is low (4 unique values in your case), legacy
> numerics would rewrite to a BooleanQuery, which is much more
> performant for MUST clauses, vs dimensional points which will always
> need to construct an up front bitset for all documents with that
> value.  Using StringField instead will ensure you always get a
> BooleanQuery...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Wed, Nov 2, 2016 at 2:43 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>> Hi florian,
>>
>> If my understanting is correct, you are using IntPoint to index 4 different
>> document types which is overkill; why not to try classic “non-tokenized”
>> keyword field (a.k.a. “legacy string”) for document types? Cardinality is
>> only four for document types.
>>
>>
>> --
>>
>> Fuad Efendi
>>
>> (416) 993-2060
>>
>> http://www.tokenizer.ca
>> Recommender Systems
>>
>>
>> On November 2, 2016 at 2:10:14 PM, Florian Hopf (
>> mailinglists@florian-hopf.de) wrote:
>>
>> Hi,
>>
>> we are indexing different types of documents in one Lucene index. They
>> have most fields in common but we need to filter some types for certain
>> queries. We are using numeric values to determine the types of documents
>> (1-4). Now, when querying these documents we see that the performance
>> degrades the more documents of a type are in the index.
>>
>> Using a simple test that indexes 10 Mio documents I can see the
>> following when filtering on everything but 100000 documents:
>>
>> * When issuing the query alone the new PointRangeQuery
>> (IntPoint.newExactQuery) is a lot faster than term and legacy numeric
>> (in my case around 2x the speed of the others)
>> * When issuing a bool query that contains a term query that selects 5
>> documents together with a must query that selects on the numeric the
>> points are 5x slower than legacy numeric
>> (LegacyNumericRangeQuery.newIntRange) and terms (TermQuery)
>> * When doing the same thing with SHOULD instead of MUST for the
>> additional term query the PointRangeQuery is fastests as well
>>
>> I suspect this to be related to the discussion in
>> https://issues.apache.org/jira/browse/LUCENE-7254
>>
>> Of course there could be something wrong with the way I am measuring the
>> performance, I'd be happy to share the code. But what I read in the
>> ticket above seems to hint that the points are not suited for every use
>> case? Is it recommended to use StringField in a case like this instead?
>>
>> Regards
>> Florian
>>
>> --
>> Florian Hopf
>> Freelance Software Developer
>>
>> http://blog.florian-hopf.de
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


-- 
Florian Hopf
Freelance Software Developer

http://blog.florian-hopf.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message