lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Smith <ssm...@mainstreamdata.com>
Subject RE: Analyzers aren't reusable?? (lucene 4.2.1)
Date Thu, 05 Dec 2013 20:53:05 GMT
Thanks for the quick response.   I'll read through the references.

Thanks again

Scott

-----Original Message-----
From: Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: Thursday, December 05, 2013 1:46 PM
To: java-user@lucene.apache.org
Subject: RE: Analyzers aren't reusable?? (lucene 4.2.1)

The problem is the CharFilter, which cannot be reused. To correctly implement the Analyzer
do the wrapping of the incoming Reader in the protected initReader():http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/Analyzer.html#initReader(java.lang.String,
java.io.Reader). In createComponents() only take the Reader from the param and create the
Tokenizer+TokenFilters (which can be reused). initReader() ensures that every call to "tokenStream"
creates a new Reader and passes it to the reused Tokenizer.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Scott Smith [mailto:ssmith@mainstreamdata.com]
> Sent: Thursday, December 05, 2013 9:36 PM
> To: java-user@lucene.apache.org
> Subject: Analyzers aren't reusable?? (lucene 4.2.1)
> 
> I wrote the following to demonstrate what for me was surprising 
> behavior (this is Lucene 4.2.1).  If you want to run this yourself, 
> you should be able to since there are no references to anything other 
> than standard lucene and java libraries.  Basically, this is an 
> analyzer that makes everything lowercase and strip all of the html tags.
> 
> public final class DemoAnalyzer extends StopwordAnalyzerBase {
>     public DemoAnalyzer()
>     {
>         super(Version.LUCENE_42);
>     }
> 
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName,
>             Reader reader)
>     {
>         final Tokenizer source = new StandardTokenizer(Version.LUCENE_42,
>                                                 new HTMLStripCharFilter(reader));
>         TokenStream result = new LowerCaseFilter(Version.LUCENE_42, source);
>         return new TokenStreamComponents(source, result);
>     }
> 
>     // this is just a debug routine to display some results.
>     public static String getTokenStream(String a_zText, Analyzer 
> a_zAnalyzer) throws IOException
>     {
>         TokenStream stream;
>         CharTermAttribute attr;
>         stream = a_zAnalyzer.tokenStream(null, new StringReader(a_zText));
>         stream.reset();
>         StringBuffer sb = new StringBuffer();
>         sb.append(a_zAnalyzer.toString());
>         sb.append("::");
>         while(stream.incrementToken())
>         {
>             attr = stream.getAttribute(CharTermAttribute.class);
>             if (sb.length() > 0)
>             {
>                 sb.append(' ');
>             }
>             sb.append(attr.toString());
>         }
> 
>         return "original String: " + a_zText + "\n" + sb.toString();
>     }
> 
> 
>     public static void main(String[] args) throws IOException
>     {
>         String text = "<p>This is a <b>TEST</b> of the demo analyzer</p>";
>         Analyzer a = new DemoAnalyzer();
> 
>         System.out.println(getTokenStream(text, a));
> 
>         System.out.println(getTokenStream(text, a));
> 
>         System.out.println(getTokenStream(text, new DemoAnalyzer()));
>     }
> }
> 
> When I run this, I get the following output:
> 
> original String: <p>This is a <b>TEST</b> of the demo analyzer</p>
> com.somedomain.DemoAnalyzer@5d3f79f7:: this is a test of the demo 
> analyzer
> 
> original String: <p>This is a <b>TEST</b> of the demo analyzer</p>
> com.somedomain.DemoAnalyzer@5d3f79f7:: p this is a b test b of the 
> demo analyzer p
> 
> original String: <p>This is a <b>TEST</b> of the demo analyzer</p>
> com.somedomain.DemoAnalyzer@138532dc:: this is a test of the demo 
> analyzer
> 
> The critical line is the second of each of the 3 pairs.  Note the line 
> in case 2 (of 3).  Rather than stripping the entire html tag, it's just stripping the
"<" and
> "/>".   Is this expected behavior?  I thought analyzers were thread-safe and
> reusable.  Am I wrong on that point?  I would expect the output of all 
> three to be the same.
> 
> Can someone explain to me what's going on?  What am I missing?
> 
> Scott


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message