lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <>
Subject Re: CJKBigramFilter - position bug with outputUnigrams?
Date Fri, 02 May 2014 03:44:35 GMT
On 4/21/2014 12:47 PM, Robert Muir wrote:
> I think you misunderstand what the filter does. It does not "output unigrams".
> In the case you choose this option, the positions are from the
> unigrams omitted by your tokenizer (StandardTokenizer or whatever),
> and it just adds bigrams as synonyms to those. It cannot safely do
> anything else.
> There can be only one "n".

I took a quick look at the code.  I'm sure it's easy to grasp once
you're really familiar with everything, but I'm having a hard time
decoding exactly how the filter works.  I don't have any more time to
plow through it tonight.

Would it be possible to implement an option with a name similar to
"lastUnigramAtPreviousPosition" so that I can optionally get the
behavior I'm after when the input is two or more characters, without
changing current behavior for anyone else?  This would completely solve
my current problem.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message