lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: CJKBigramFilter - position bug with outputUnigrams?
Date Sun, 20 Apr 2014 11:21:56 GMT
There is no bug here. the positions are correct.

If you want to use phrase queries, i wouldnt try to be so tricky with n-grams.

This never works well, and there is nothing to fix...

On Sun, Apr 20, 2014 at 2:02 AM, Shawn Heisey <solr@elyograg.org> wrote:
> The analysis chain on some of my Solr fieldType entries includes
> CJKBigramFilterFactory on both the index and query.  I had
> outputUnigrams enabled on the index side, but had it disabled on the
> query side.  This resulted in a problem with phrase queries.  This is a
> subset of the index analysis for the three terms you can see in the
> ICUNF step, separated by spaces.  One word has been replaced with
> 'redacted' ... it's in Latin1 script and there's nothing unusual about it:
>
> https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png
>
> Note that in the CJKBF step, the second unigram is output at position 2,
> pushing the english terms to 3 and 4.
>
> Imagine that the customer is doing a phrase search.  What ends up
> getting sent to Solr is a filter query like this:
>
> field:"綾瀬 haruka"
>
> The query analysis on this, which doesn't output unigrams, has "haruka"
> at position 2.  As already shown, the index analysis puts "haruka" at
> position 3.  The query doesn't match, because it's a phrase query and
> has no phrase slop.
>
> I would have expected both unigrams to be at position 1.  To me, it's a
> bug ... or at least something that I should be able to configure on the
> filter.
>
> If this gets sent via the main query (edismax), it all works, because I
> have phrase slop enabled by default.
>
> The customer does not like what happens when the index and query
> analyzers match, either with or without outputUnigrams.  When
> outputUnigrams is completely disabled, searching for a single character
> doesn't match multi-character strings, and when it is enabled on both,
> they get matches they did not want.
>
> I've already been pointed at an awesome blog series, which will
> hopefully help me improve things, but I think that the customer will
> still want outputUnigrams disabled on the query side, so I still have
> this problem.
>
> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
>
> If I file an issue, should it be bug or improvement?
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message