lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: Custom indexing
Date Mon, 18 Apr 2016 20:27:21 GMT


Hi,

Please try letter tokenizer, it should cover your example.

Ahmet

On Monday, April 18, 2016 3:02 PM, PK C <tech.kumarpch@gmail.com> wrote:



Hi,

   Thank you very much for your quick responses.

Jack Krupansky,

The main use case is searching in file names. For example, lucene.txt,
lucene_new.txt, lucene_1_new.txt. If I use 'lucene', I need to get all 3
files. with 'new' I need to get last two files. Please note that Standard
analyzer/tokenizer of lucene 3.6 is not giving us the results with
tokenization of  "." and "_". Are you referring to later versions than 3.6 ?

Ahmet,

1. Not sure if LetterTokenizer helps with the above example of having
numbers and letters in file names.
2. WordDelimeterFilter does not seem to be lucene 3.6
3. MappingCharFilter  is what I am already using overriding initReader
method in my CustomAnalyzer (Source copied from StandardAnalyzer (final
class)). Is this a good way to make use of final class StandardAnalyzer
with some custom changes ? Or composition is better ?

Thank you again,
Best Regards


On Tue, Apr 12, 2016 at 8:45 PM, Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> The standard analyzer/tokenizer should do a decent job of splitting on dot,
> hyphen, and underscore, in addition to whitespace and other punctuation.
>
> Can you post some specific test cases you are concerned with? (You should
> always run some test cases.)
>
> -- Jack Krupansky
>
> On Tue, Apr 12, 2016 at 10:35 AM, Ahmet Arslan <iorixxx@yahoo.com.invalid>
> wrote:
>
> > Hi Chamarty,
> >
> > Well, there are a lot of options here.
> >
> > 1) Use LetterTokenizer
> > 2) Use WordDelimeterFilter combined with WhiteSpaceTokenizer
> > 3) Use MappingCharFilter to replace those characters with spaces
> > .
> > .
> > .
> >
> > Ahmet
> >
> >
> > On Tuesday, April 12, 2016 3:58 PM, PrasannaKumar Chamarty <
> > tech.kumarpch@gmail.com> wrote:
> >
> >
> >
> > Hi,
> >
> > What is the best way (in terms of maintenance required with new lucene
> > releases) to allow splitting of words on "." and "_" for indexing ? Thank
> > you.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message