lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Digy" <>
Subject RE: Can't get tokenization/stop works working
Date Tue, 02 Feb 2010 21:46:37 GMT
Seeing "" in the index means that your analyzer returns it as a
single token. To strip out "www" and "com", you have to use an analyzer that
returns tokens as "www", "fubar" and " com". 

Try to use a different analyzer( or write your own  as below ).


    //a C# example

    public class LetterOrDigitAnalyzer : Analyzer


        public override TokenStream TokenStream(string fieldName,
System.IO.TextReader reader)


            TokenStream t = new LetterOrDigitTokenizer(reader);

            t = new LowerCaseFilter(t);

            return t;




    public class LetterOrDigitTokenizer : CharTokenizer


        public LetterOrDigitTokenizer(TextReader input) : base(input)




        protected override bool IsTokenChar(char c)


            return char.IsLetterOrDigit(c);







-----Original Message-----
From: jchang [] 
Sent: Tuesday, February 02, 2010 11:16 PM
Subject: Re: Can't get tokenization/stop works working



I am using org.apache.lucene.analysis.snowball.SnowballAnalyzer.


Looking through luke, I see that was indexed, not fubar.  So,

clearly, I'm not stripping out the stop words of www and com.  Any ideas?




View this message in context:

Sent from the Lucene - Java Users mailing list archive at




To unsubscribe, e-mail:

For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message