lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Whelan, Andy" <awhe...@srcinc.com>
Subject Preceding special characters in ClassicTokenizerFactory
Date Mon, 03 Oct 2016 18:51:51 GMT
Hello,
I am guessing that what I am looking for is probably going to require extending StandardTokenizerFactory
or ClassicTokenizerFactory. But I thought I would ask the group here before attempting this.
We are indexing documents from an eclectic set of sources. There is, however, a heavy interest
in computing and social media sources. So computer terminology and social media terms (terms
beginning with hashes (#), @ symbols, etc.) are terms that we would like to have searchable.

We are considering the ClassicTokenizerFactory because we like the fact that it does not use
the Unicode standard annex UAX#29<http://unicode.org/reports/tr29/#Word_Boundaries>
word boundary rules. It preserves email addresses, internet domain names, etc.  We would also
like to use it as the tokenizer element of index and query analyzers that would preserve @<
rest of token > or #<rest of token> patterns.

I have seen examples where folks are replacing the StandardTokenizerFactory in their analyzer
with stream combinations made up of charFilters,  WhitespaceTokenizerFactory, etc. as in the
following article http://www.prowave.io/indexing-special-terms-using-solr/ to remedy such
problems.

Example:
         <analyzer type="index">
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.\s)"
replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.$)"
replacement="" />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(,)"
replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(;)"
replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\|)"
replacement=" " />
                 <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\/)"
replacement=" " />
                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                 <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt"
ignoreCase="true" expand="false"/>
                 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />
                 <filter class="solr.LowerCaseFilterFactory"/>
         </analyzer>


I am just wondering if anyone knew of a smart way (without extending classes) to actually
preserve most of the ClassicTokenizerFactory functionality without getting rid of leading
special characters? The "Solr In Action" book (page 179) claims that it is hard to extend
the StandardTokenizerFactory. I'm assuming this is the same for ClassicTokenizerFactory.

Thanks
-Andrew


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message