lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <>
Subject ShingleFilter with outputUnigrams=false
Date Sat, 02 Jan 2010 11:57:44 GMT
I am using lucene 2.9.1 and I was trying to understand the ShingleFilter and wrote the code

String test = "please divide this sentence";
Tokenizer wsTokenizer = new WhitespaceTokenizer(new StringReader(test));
ShingleFilter filter = new ShingleFilter(wsTokenizer, 3);
TermAttribute termAtt = (TermAttribute) filter.getAttribute(TermAttribute.class);

while (filter.incrementToken())            System.out.println(termAtt.term());

I noticed that if I set outputUnigrams to false it gives me the same output for maxShingleSize=2
and maxShingleSize=3.

please divide 
divide this 
this sentence 

when i set maxShingleSize to 4 output is:

please divide 
please divide this sentence 
divide this 
this sentence 

I was expecting the output as follows with maxShingleSize=3 and outputUnigrams=false :

please divide this 
divide this sentence 

Am I missing something or this is the expected behavior?

I checked source code of ShingleFilterTest (lucene 3.0.0) and see that TRI_GRAM_TOKENS are
tested with only outputUnigrams=true but not with outputUnigrams=false.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message