lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin O'Shea" <app...@dsl.pipex.com>
Subject RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2
Date Mon, 10 Nov 2014 14:17:06 GMT
Uwe

Thanks for the reply. Given that SnowBallAnalyzer is made up of a series of filters, I was
thinking about something like this where I 'pipe' output from one filter to the next:

standardTokenizer =new StandardTokenizer (...);
standardFilter = new StandardFilter(standardTokenizer,...);
stopFilter = new StopFilter(standardFilter,...);
snowballFilter = new SnowballFilter(stopFilter,...);

But ignore LowerCaseFilter. Does this make sense?

Thanks

Martin O'Shea.
-----Original Message-----
From: Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: 10 Nov 2014 14 06
To: java-user@lucene.apache.org
Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

Hi,

In general, you cannot change Analyzers, they are "examples" and can be seen as "best practise".
If you want to modify them, write your own Analyzer subclass which uses the wanted Tokenizers
and TokenFilters as you like. You can for example clone the source code of the original and
remove LowercaseFilter. Analyzers are very simple, there is no logic in them, it's just some
"configuration" (which Tokenizer and which TokenFilters). In later Lucene 3 and Lucene 4,
this is very simple: You just need to override createComponents in Analyzer class and add
your "configuration" there.

If you use Apache Solr or Elasticsearch you can create your analyzers by XML or JSON configuration.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Martin O'Shea [mailto:m.oshea@dsl.pipex.com]
> Sent: Monday, November 10, 2014 2:54 PM
> To: java-user@lucene.apache.org
> Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in 
> Lucene 3.0.2
> 
> I realise that 3.0.2 is an old version of Lucene but if I have Java 
> code as
> follows:
> 
> 
> 
> int nGramLength = 3;
> 
> Set<String> stopWords = new Set<String>();
> 
> stopwords.add("the");
> 
> stopwords.add("and");
> 
> ...
> 
> SnowballAnalyzer snowballAnalyzer = new 
> SnowballAnalyzer(Version.LUCENE_30,
> "English", stopWords);
> 
> ShingleAnalyzerWrapper shingleAnalyzer = new 
> ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> 
> 
> 
> Which will generate the frequency of ngrams from a particular a string 
> of text without stop words, how can I disable the LowerCaseFilter 
> which forms part of the SnowBallAnalyzer? I want to preserve the case 
> of the ngrams generated so that I can perform various counts according 
> to the presence / absence of upper case characters in the ngrams.
> 
> 
> 
> I am something of a Lucene newbie. And I should add that upgrading the 
> version of Lucene is not an option here.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message