lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: AlphaNumeric analyzer/tokenizer
Date Fri, 16 Aug 2019 16:04:43 GMT
Hi,

The easiest is to use PatternTokenizer as part of your analyzer. It uses a regular expression
to split words. Just use some regular expression that matches unicode ranges for numbers and
digits.

To build your Analyzer use the class CustomAnalyzer and its builder API to construct your
own analysis chain. User PatternTokenizerFactory as tokenizer and add stuff like LowercaseFilterFactory
and you are done. No need for any new components in Lucene. It's all there, RTFM 😊

https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizerFactory.html
(the example there is for Apache Solr, but you can use the same parameter names in CustomAnalyzer)

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Abhishek Chauhan <abhishek.chauhan792@gmail.com>
> Sent: Friday, August 16, 2019 11:23 AM
> To: java-user@lucene.apache.org
> Subject: AlphaNumeric analyzer/tokenizer
> 
> Hi,
> 
> We have been using SimpleAnalyzer which keeps only letters in its tokens.
> This limits us to search in strings that contains both letters and numbers.
> For e.g. "axt1234". SimpleAnalyzer would only enable us to search for "axt"
> successfully, but search strings like "axt1", "axt123" etc would give no
> results because while indexing it ignored the numbers.
> 
> I can use StandardAnalyzer or WhitespaceAnalyzer but I want to tokenize on
> underscores also
> which these analyzers don't do. I have also looked at WordDelimiterFilter
> which will split "axt1234" into "axt" and "1234". However, using this also,
> I cannot search for "axt12" etc.
> 
> Is there something like an Alphanumeric analyzer which would be very
> similar to SimpleAnalzyer but in addition to letters it would also keep
> digits in its tokens? I am willing contribute such an analyzer if one is
> not available.
> 
> Thanks and Regards,
> Abhishek


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message