Seeing "www.fubar.com" in the index means that your analyzer returns it as a
single token. To strip out "www" and "com", you have to use an analyzer that
returns tokens as "www", "fubar" and " com".
Try to use a different analyzer( or write your own as below ).
//a C# example
public class LetterOrDigitAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName,
System.IO.TextReader reader)
{
TokenStream t = new LetterOrDigitTokenizer(reader);
t = new LowerCaseFilter(t);
return t;
}
}
public class LetterOrDigitTokenizer : CharTokenizer
{
public LetterOrDigitTokenizer(TextReader input) : base(input)
{
}
protected override bool IsTokenChar(char c)
{
return char.IsLetterOrDigit(c);
}
}
DIGY
-----Original Message-----
From: jchang [mailto:jchangkihatest@gmail.com]
Sent: Tuesday, February 02, 2010 11:16 PM
To: java-user@lucene.apache.org
Subject: Re: Can't get tokenization/stop works working
I am using org.apache.lucene.analysis.snowball.SnowballAnalyzer.
Looking through luke, I see that www.fubar.com was indexed, not fubar. So,
clearly, I'm not stripping out the stop words of www and com. Any ideas?
--
View this message in context:
http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546
p27427519.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|