lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2
Date Mon, 10 Nov 2014 14:05:31 GMT
Hi,

In general, you cannot change Analyzers, they are "examples" and can be seen as "best practise".
If you want to modify them, write your own Analyzer subclass which uses the wanted Tokenizers
and TokenFilters as you like. You can for example clone the source code of the original and
remove LowercaseFilter. Analyzers are very simple, there is no logic in them, it's just some
"configuration" (which Tokenizer and which TokenFilters). In later Lucene 3 and Lucene 4,
this is very simple: You just need to override createComponents in Analyzer class and add
your "configuration" there.

If you use Apache Solr or Elasticsearch you can create your analyzers by XML or JSON configuration.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Martin O'Shea [mailto:m.oshea@dsl.pipex.com]
> Sent: Monday, November 10, 2014 2:54 PM
> To: java-user@lucene.apache.org
> Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in
> Lucene 3.0.2
> 
> I realise that 3.0.2 is an old version of Lucene but if I have Java code as
> follows:
> 
> 
> 
> int nGramLength = 3;
> 
> Set<String> stopWords = new Set<String>();
> 
> stopwords.add("the");
> 
> stopwords.add("and");
> 
> ...
> 
> SnowballAnalyzer snowballAnalyzer = new
> SnowballAnalyzer(Version.LUCENE_30,
> "English", stopWords);
> 
> ShingleAnalyzerWrapper shingleAnalyzer = new
> ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> 
> 
> 
> Which will generate the frequency of ngrams from a particular a string of text
> without stop words, how can I disable the LowerCaseFilter which forms part
> of the SnowBallAnalyzer? I want to preserve the case of the ngrams
> generated so that I can perform various counts according to the presence /
> absence of upper case characters in the ngrams.
> 
> 
> 
> I am something of a Lucene newbie. And I should add that upgrading the
> version of Lucene is not an option here.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message