lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject Re: Searching chomps my terms..
Date Tue, 11 Mar 2008 15:40:51 GMT


Steven A Rowe wrote:
> On 03/11/2008 at 8:46 AM, André Warnier wrote:
>> João Rodrigues wrote:
>>> @André:
>>>
>>> Even if I use Simple Analyzer, which I think should leave the term
>>> "alone", the number gets "eaten".
>> I'm no expert, so I was just launching that answer to see if it elicited
>> more qualified responses. But I found this on Google :
>> http://project.iml.umu.se/projects/scam-repository/ticket/2 (seems to
>> say also that SimpleAnalyser does not retain numbers, and that you
>> should try StandardAnalyser instead).
>>
>> (But I must say that precise documentation seems hard to find).
> 
> The API docs are at: <http://lucene.apache.org/java/2_3_1/api/>.  Find the class
name you're interested in and follow it where it goes :) .
> 
> SimpleAnalyzer is "[a]n Analyzer that filters LetterTokenizer with LowerCaseFilter":
> 
> <http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/SimpleAnalyzer.html>
> 
> LetterTokenizer's docs say:
> 
>    A LetterTokenizer is a tokenizer that divides text at non-letters.
>    That's to say, it defines tokens as maximal strings of adjacent
>    letters, as defined by java.lang.Character.isLetter() predicate.
> 
> <http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LetterTokenizer.html>
> 
> LowercaseFilter "[n]ormalizes token text to lower case":
> 
> <http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/LowerCaseFilter.html>
> 
> Exercise for the reader: find the docs for StandardAnalyzer :) .

http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardAnalyzer.html
Clever, that.

Thanks for the information above.  Since I am myself trying to learn 
Lucene, this discussion comes in handy.

In other words, SimpleAnalyzer is also not the right tool to use for 
indexing acronyms such as "P2P" or "W3C"..


For the casual user, the practical problem is that in the doc, in the 
paragraph
"An Analyzer that filters LetterTokenizer with LowerCaseFilter."
the words "LetterTokenizer" and "LowerCaseFilter" are not themselves 
links to the corresponding classes (docs).
So the casual user has no idea where in the hierarchy to go and look for 
those.  The StandardAnalyser is a case in point.
Moving up the class hierarchy doesn't help that much, since one quickly 
ends at Object.
Entering the URL 
"http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/"
yelds a long list of things, among which the ones looked for, but it can 
hardly be considered user-friendly.

Now here is a brilliant idea : why not create a public Lucene site, 
where the docs would be indexed with... Lucene ?
Or is there already such a thing ?

André


Mime
View raw message