lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin O'Shea" <app...@dsl.pipex.com>
Subject RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2
Date Tue, 11 Nov 2014 18:35:45 GMT
Ahmet, 

Yes that is quite true. But as this is only a proof of concept application,
I'm prepared for things to be 'imperfect'.

Martin O'Shea.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID] 
Sent: 11 Nov 2014 18 26
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

With that analyser, your searches (for same word, but different capitalised)
could return different results.

Ahmet


On Tuesday, November 11, 2014 6:57 PM, Martin O'Shea <appy74@dsl.pipex.com>
wrote:
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

"NOTE: SnowballFilter expects lowercased text." [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
> series of filters, I was thinking about something like this where I 
> 'pipe' output from one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new 
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set<?>
stopWords, boolean ignoreCase)

Uwe

> Martin O'Shea.
> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: 10 Nov 2014 14 06
> To: java-user@lucene.apache.org
> Subject: RE: How to disable LowerCaseFilter when using 
> SnowballAnalyzer in Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can 
> be seen as "best practise". If you want to modify them, write your own 
> Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
> you like. You can for example clone the source code of the original 
> and remove LowercaseFilter. Analyzers are very simple, there is no 
> logic in them, it's just some "configuration" (which Tokenizer and 
> which TokenFilters). In later Lucene 3 and Lucene 4, this is very
> simple: You just need to override createComponents in Analyzer class 
> and
add your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers 
> by XML or JSON configuration.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
> > -----Original Message-----
> > From: Martin O'Shea [mailto:m.oshea@dsl.pipex.com]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: java-user@lucene.apache.org
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
> > in Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java 
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set<String> stopWords = new Set<String>();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new 
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new 
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a 
> > string of text without stop words, how can I disable the 
> > LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
> > preserve the case of the ngrams generated so that I can perform 
> > various counts according to the presence / absence of upper case
characters in the ngrams.
> >
> >
> >
> > I am something of a Lucene newbie. And I should add that upgrading 
> > the version of Lucene is not an option here.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org






> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message