lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: WordDelimiterFilterFactory and StandardTokenizer
Date Fri, 16 May 2014 20:56:28 GMT
Hi Aiguofer,

You mean ClassicTokenizer? Because StandardTokenizer does not set token types (e-mail, url,
etc).


I wouldn't go with the JFlex edit, mainly because maintenance costs. It will be a burden to
maintain a custom tokenizer.

MappingCharFilters could be used to manipulate tokenizer behavior.

Just an example, if you don't want your tokenizer to break on hyphens, replace it with something
that your tokenizer does not break. For example under score.

"-" => "_"



Plus WDF can be customized too. Please see types attribute :

http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt

 
Ahmet


On Friday, May 16, 2014 6:24 PM, aiguofer <difernan@redhat.com> wrote:
Jack Krupansky-2 wrote

> Typically the white space tokenizer is the best choice when the word 
> delimiter filter will be used.
> 
> -- Jack Krupansky

If we wanted to keep the StandardTokenizer (because we make use of the token
types) but wanted to use the WDFF to get combinations of words that are
split with certain characters (mainly - and /, but possibly others as well),
what is the suggested way of accomplishing this? Would we just have to
extend the JFlex file for the tokenizer and re-compile it?



--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message