lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael _ <>
Subject Re: Preserving "C++" and other weird tokens
Date Fri, 07 Aug 2009 15:10:19 GMT
On Thu, Aug 6, 2009 at 11:38 AM, Michael _ <> wrote:

> Hi everyone,
> I'm indexing several documents that contain words that the
> StandardTokenizer cannot detect as tokens.  These are words like
>   C#
>   .NET
>   C++
> which are important for users to be able to search for, but get treated as
> "C", "NET", and "C".
> How can I create a list of words that should be understood to be
> indivisible tokens?  Is my only option somehow stringing together a lot of
> PatternTokenizers?  I'd love to do something like <tokenizer
> class="StandardTokenizer" tokenwhitelist=".NET C++ C#" />.
> Thanks in advance!

By the way, in case it wasn't clear: I'm not particularly tied to using the
StandardTokenizer.  Any tokenizer would be fine, if it did a reasonable job
of splitting up the input text while preserving special cases.

I'm also not averse to passing in a list of regexes, if I had to, but I'm
suspicious that that would be redoing a lot of the work done by the parser
inside the Tokenizer.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message