lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: Searching chomps my terms..
Date Tue, 11 Mar 2008 15:10:31 GMT
On 03/11/2008 at 8:46 AM, André Warnier wrote:
> João Rodrigues wrote:
> > @André:
> > 
> > Even if I use Simple Analyzer, which I think should leave the term
> > "alone", the number gets "eaten".
>
> I'm no expert, so I was just launching that answer to see if it elicited
> more qualified responses. But I found this on Google :
> http://project.iml.umu.se/projects/scam-repository/ticket/2 (seems to
> say also that SimpleAnalyser does not retain numbers, and that you
> should try StandardAnalyser instead).
> 
> (But I must say that precise documentation seems hard to find).

The API docs are at: <http://lucene.apache.org/java/2_3_1/api/>.  Find the class name
you're interested in and follow it where it goes :) .

SimpleAnalyzer is "[a]n Analyzer that filters LetterTokenizer with LowerCaseFilter":

<http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/SimpleAnalyzer.html>

LetterTokenizer's docs say:

   A LetterTokenizer is a tokenizer that divides text at non-letters.
   That's to say, it defines tokens as maximal strings of adjacent
   letters, as defined by java.lang.Character.isLetter() predicate.

<http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LetterTokenizer.html>

LowercaseFilter "[n]ormalizes token text to lower case":

<http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LowerCaseFilter.html>

Exercise for the reader: find the docs for StandardAnalyzer :) .

Steve

Mime
View raw message