lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: Preceding special characters in ClassicTokenizerFactory
Date Mon, 03 Oct 2016 20:16:56 GMT
Hi Andy,

WordDelimeterFilter has "types" option. There is an example file named wdftypes.txt in the
source tree that preserves #hashtags and @mentions. If you follow this path, please use Whitespace
tokenizer.

Ahmet



On Monday, October 3, 2016 9:52 PM, "Whelan, Andy" <awhelan@srcinc.com> wrote:
Hello,
I am guessing that what I am looking for is probably going to require extending StandardTokenizerFactory
or ClassicTokenizerFactory. But I thought I would ask the group here before attempting this.
We are indexing documents from an eclectic set of sources. There is, however, a heavy interest
in computing and social media sources. So computer terminology and social media terms (terms
beginning with hashes (#), @ symbols, etc.) are terms that we would like to have searchable.

We are considering the ClassicTokenizerFactory because we like the fact that it does not use
the Unicode standard annex UAX#29<http://unicode.org/reports/tr29/#Word_Boundaries>
word boundary rules. It preserves email addresses, internet domain names, etc.  We would also
like to use it as the tokenizer element of index and query analyzers that would preserve @<
rest of token > or #<rest of token> patterns.

I have seen examples where folks are replacing the StandardTokenizerFactory in their analyzer
with stream combinations made up of charFilters,  WhitespaceTokenizerFactory, etc. as in the
following article http://www.prowave.io/indexing-special-terms-using-solr/ to remedy such
problems.

Example:
         <analyzer type="index">
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.\s)"
replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.$)"
replacement="" />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(,)"
replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(;)"
replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\|)"
replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\/)"
replacement=" " />
                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                 <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt"
ignoreCase="true" expand="false"/>
                 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />
                 <filter class="solr.LowerCaseFilterFactory"/>
         </analyzer>


I am just wondering if anyone knew of a smart way (without extending classes) to actually
preserve most of the ClassicTokenizerFactory functionality without getting rid of leading
special characters? The "Solr In Action" book (page 179) claims that it is hard to extend
the StandardTokenizerFactory. I'm assuming this is the same for ClassicTokenizerFactory.

Thanks
-Andrew

Mime
View raw message