lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: SolrUser - Reindex
Date Thu, 13 May 2010 23:17:17 GMT
In general, it's hard to just answer since there are many
factors to consider, not the least of which is what you
want it to do. In this case, I suspect the issue is
WordDelimiterFactory, it splits words on all non
alphanumerics by default.

It would probably be a good idea to work with
the various combinations of tokenizers and filters
to get a feel for what they do.

The admin analysis page allows you to put in arbitrary
text and see what the results of analysis are. So if you
define a bunch of different fields in your schema (just for
testing), and then put text in the analysis page you'll
see what transformations occur. This is invaluable for
understanding the differences. And until you get a good
idea what various tokenizers and filters do both in isolation
and in combination, you'll get lots of surprises. Even after
you're familiar with them, you'll *still* get surprises, but at
least you'll have a chance to figure it out...

Best
Erick


On Thu, May 13, 2010 at 5:23 PM, Anderson vasconcelos <
anderson.vass@gmail.com> wrote:

> I'm using the textgen fieldtype on my field as follow:
> <fieldType name="textgen" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> .....
>  <dynamicField name="field_value_*"  type="textgen"    indexed="true"
> stored="true"/>
>
> .....
>
> They no remove the @ symbol. To configure to index the @ symbol i must use
> HTMLStripStandardTokenizerFactory ?
>
> Thanks
>
> 2010/5/13 Erick Erickson <erickerickson@gmail.com>
>
> > Probably your analyzer is removing the @ symbol, it's hard to say if you
> > don't include the relevant parts of your schema.
> >
> > This page might help:
> > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> >
> > <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>Best
> > Erick
> >
> > On Thu, May 13, 2010 at 3:59 PM, Anderson vasconcelos <
> > anderson.vass@gmail.com> wrote:
> >
> > > Why solr/lucene no index the Character '@' ?
> > >
> > > I send to index email fields xxx@gmail.com ...and after try do search
> > > to_email:*@*, and not found.
> > >
> > > I need to do some configuration?
> > >
> > > Thanks
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message