lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego Fernandez <difer...@redhat.com>
Subject Re: WordDelimiterFilterFactory and StandardTokenizer
Date Tue, 20 May 2014 14:49:48 GMT
Great, thanks for the information!  Right now we're using the StandardTokenizer types to filter
out CJK characters with a custom filter.  I'll test using MappingCharFilters, although I'm
a little concerned with possible adverse scenarios.  

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


----- Original Message -----
> Hi Aiguofer,
> 
> You mean ClassicTokenizer? Because StandardTokenizer does not set token types
> (e-mail, url, etc).
> 
> 
> I wouldn't go with the JFlex edit, mainly because maintenance costs. It will
> be a burden to maintain a custom tokenizer.
> 
> MappingCharFilters could be used to manipulate tokenizer behavior.
> 
> Just an example, if you don't want your tokenizer to break on hyphens,
> replace it with something that your tokenizer does not break. For example
> under score.
> 
> "-" => "_"
> 
> 
> 
> Plus WDF can be customized too. Please see types attribute :
> 
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt
> 
>  
> Ahmet
> 
> 
> On Friday, May 16, 2014 6:24 PM, aiguofer <difernan@redhat.com> wrote:
> Jack Krupansky-2 wrote
> 
> > Typically the white space tokenizer is the best choice when the word
> > delimiter filter will be used.
> > 
> > -- Jack Krupansky
> 
> If we wanted to keep the StandardTokenizer (because we make use of the token
> types) but wanted to use the WDFF to get combinations of words that are
> split with certain characters (mainly - and /, but possibly others as well),
> what is the suggested way of accomplishing this? Would we just have to
> extend the JFlex file for the tokenizer and re-compile it?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 

Mime
View raw message