lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Smith <ssm...@mainstreamdata.com>
Subject Analyzers aren't reusable?? (lucene 4.2.1)
Date Thu, 05 Dec 2013 20:35:59 GMT
I wrote the following to demonstrate what for me was surprising behavior (this is Lucene 4.2.1).
 If you want to run this yourself, you should be able to since there are no references to
anything other than standard lucene and java libraries.  Basically, this is an analyzer that
makes everything lowercase and strip all of the html tags.

public final class DemoAnalyzer extends StopwordAnalyzerBase
{
    public DemoAnalyzer()
    {
        super(Version.LUCENE_42);
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName,
            Reader reader)
    {
        final Tokenizer source = new StandardTokenizer(Version.LUCENE_42,
                                                new HTMLStripCharFilter(reader));
        TokenStream result = new LowerCaseFilter(Version.LUCENE_42, source);
        return new TokenStreamComponents(source, result);
    }

    // this is just a debug routine to display some results.
    public static String getTokenStream(String a_zText, Analyzer a_zAnalyzer) throws IOException
    {
        TokenStream stream;
        CharTermAttribute attr;
        stream = a_zAnalyzer.tokenStream(null, new StringReader(a_zText));
        stream.reset();
        StringBuffer sb = new StringBuffer();
        sb.append(a_zAnalyzer.toString());
        sb.append("::");
        while(stream.incrementToken())
        {
            attr = stream.getAttribute(CharTermAttribute.class);
            if (sb.length() > 0)
            {
                sb.append(' ');
            }
            sb.append(attr.toString());
        }

        return "original String: " + a_zText + "\n" + sb.toString();
    }


    public static void main(String[] args) throws IOException
    {
        String text = "<p>This is a <b>TEST</b> of the demo analyzer</p>";
        Analyzer a = new DemoAnalyzer();

        System.out.println(getTokenStream(text, a));

        System.out.println(getTokenStream(text, a));

        System.out.println(getTokenStream(text, new DemoAnalyzer()));
    }
}

When I run this, I get the following output:

original String: <p>This is a <b>TEST</b> of the demo analyzer</p>
com.somedomain.DemoAnalyzer@5d3f79f7:: this is a test of the demo analyzer

original String: <p>This is a <b>TEST</b> of the demo analyzer</p>
com.somedomain.DemoAnalyzer@5d3f79f7:: p this is a b test b of the demo analyzer p

original String: <p>This is a <b>TEST</b> of the demo analyzer</p>
com.somedomain.DemoAnalyzer@138532dc:: this is a test of the demo analyzer

The critical line is the second of each of the 3 pairs.  Note the line in case 2 (of 3). 
Rather than stripping the entire html tag, it's just stripping the "<" and "/>".   Is
this expected behavior?  I thought analyzers were thread-safe and reusable.  Am I wrong on
that point?  I would expect the output of all three to be the same.

Can someone explain to me what's going on?  What am I missing?

Scott

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message