lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: Custom indexing
Date Tue, 19 Apr 2016 08:05:55 GMT
> The main use case is searching in file names. For example, lucene.txt,
> lucene_new.txt, lucene_1_new.txt. If I use 'lucene', I need to get all 3
> files. with 'new' I need to get last two files. Please note that Standard
> analyzer/tokenizer of lucene 3.6 is not giving us the results with
> tokenization of  "." and "_". Are you referring to later versions than 3.6 ?

Hi StandardTokenizer in 3.6 is the old, non Unicode-compliant tokenizer classic tokenizer.
In Lucene 4+ it is called "ClassicTokenizer" because it is still used by some users, but newer
code should use the new StandardTokenizer. From Lucene 4 on, StandardTokenizer implements
the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode
Standard Annex #29.

This one is not available in such old Lucene versions, sorry. Your only chance is LetterTokenizer
or write your own.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message