lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Byrne <>
Subject bigram analysis
Date Mon, 03 Mar 2008 10:40:32 GMT

I need to use stop-word bigrams, liike the Nutch analyzer, as described 
in LIA 4.8 (Nutch Analysis). What I don't understand is, why does it 
keep the original stop word intact? I can see great advantage to being 
able to search for a combination of stop word + real word, but I don't 
see the point of keeping the stop word as a token on it's own. Searches 
with just that word would be as pointless as ever.

Is the idea to allow searching on all stop words, even on their own, and 
the bigrams are just an optimization that will improve things 90% of the 
time? Or is it just a side effect of the bigram analyzer that it 
produces a token from the stop word, and therefore it could just be 
filtered out by a stop word filter afterwards, leaving only the bigram 
and the original (non-stop) word?

I'm sure either way would work fr me - just wondering what is normally 
done, and if I'm missing something important here...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message