lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten F." <karsten-s...@gmx.de>
Subject Re: Under the hood of SpanQueries
Date Thu, 11 Apr 2013 18:22:02 GMT
Hi Igor,
About your performance problem with SpanQueries and Payloads:
Try to filter with the corresponding BooleanQuery and use a profiler.
You have an IO-bottleneck because of reading position and payload 
information per document.
Possible it would help if you first filter off the "obviously" no hits.
"Obviously" documents without all the search-terms from the SpanQuerie 
are no hits.
So we don't need the term position for documents which do not match all 
search-terms even without the position information. But SpanQueries read 
this information even for this "obviously" no hits
SpanQueries like PhraseQueries are implicit must-BooleanQueries.
But SpanQueries directly read the position information of the term for 
each document
(PhraseQueries first check, that all terms belongs to the document).
So it could help if you make the implicit BooleanQuery explicit. First 
collect the hits of the BooleanQuery and then search with the SpanQuery 
only inside this collection (use DocIdSet as Filter).
If this does not help use a profiler and ask again ;-)
Best regards,
Karsten

ps. in context: 
http://lucene.472066.n3.nabble.com/Under-the-hood-of-SpanQueries-td4053638.html

On 04/03/2013 11:55 PM, Igor Shalyminov wrote:
> Hi all!
>
> I have a ~20GB index of documents that have words with several attributes associated
with them, e.g.:
>
> WORD: word_1 word_2 ... word_n
> POS:    pos1_1:pos1_2:pos1:3 pos2 ... pos_n_1:pos_n_2
> LEMMA: lemma1_1:lemma1:2:lemma1_3 lemma2 lemma_n_1:lemma_n_2
>
> Field tokens separated by ':' are ambiguous, i.e. they correspond to the same position
in the document.
> An important detail of ambiguous word attributes is that, e.g., pos1_1 corresponds only
to lemma1_1, not to lemma1_2 or 1_3, so one must not match word_1 when searching for pos1_1
& lemma1_3 at the same position.
>
> I handle ambiguous tokens position with standard positionIncrement = 0, and attribute
number correspondence with token payloads. Say, lemma1_1 has payload = 1, lemma1_2 - 2; pos1_1
- 1, pos1_2 - 2, and so on. And while searching for token attributes at the same position
I use payload filter that checks if the payloads of all tokens matched are the same.
>
> And that's it: SpanNearQueries run super slow on that index (10's of seconds, and the
majority of indexed documents matches to a common query).
> I don't know actually how SpanQueries work in-depth, but is there some inefficiency in
them by design? Or is payload retrieval so expensive?
> I'm just wondering if I'm missing something obvious that slows down the entire search.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message