lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Rose <>
Subject Re: shingles work in analyzer but not real data
Date Thu, 02 Sep 2010 13:05:58 GMT
On Wed, Sep 1, 2010 at 3:35 PM, Robert Muir <> wrote:

> On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose <> wrote:
> > Hi,
> >  We are using SOLR to match query strings with a keyword database, where
> > some of the keywords are actually more than one word.  For example a
> > keyword
> > might be "apple pie" and we only want it to match for a query containing
> > that word pair, but not one only containing "apple".  Here is the
> relevant
> > piece of the schema.xml, defining the index and query pipelines:
> >
> >  <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> >     <analyzer type="index">
> >       <tokenizer class="solr.PatternTokenizerFactory" pattern=";"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.TrimFilterFactory" />
> >     </analyzer>
> >     <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.TrimFilterFactory" />
> > <filter class="solr.ShingleFilterFactory" />
> >      </analyzer>
> >   </fieldType>
> >
> > In the analysis tool this schema looks like it works correctly.  Our
> > multi-word keywords are indexed as a single entry, and then when a search
> > phrase contains one of these multi-word keywords it is shingled and
> > matched.
> >  Unfortunately, when we do the same queries on top of the actual index it
> > responds with zero matches.  I can see in the index histogram that the
> > terms
> > are correctly indexed from our mysql datasource containing the keywords,
> > but
> > somehow the shingling doesn't appear to work on this live data.  Does
> > anyone
> > have experience with shingling that might have some tips for us, or
> > otherwise advice for debugging the issue?
> >
> query-time shingling probably isnt working with the queryparser you are
> using, the default lucene one first splits on whitespace before sending it
> to the analyzer: e.g. a query of foo bar is processed as TokenStream(foo) +
> TokenStream(bar)
> so query-time shingling like this doesn't work as you expect for this
> reason.

Hi Robert, thanks for the response.  I've looked into the query parsers a
bit and I did find that using the raw parser on a matching multi-word
keyword works correctly.  I need to have shingling though, in order to
support query phrases.  It seems odd to have the query parser emitting
tokens though.  If this is the case why would we ever use the
WhitespaceTokenizer?  Either way, do you know what the correct configuration
should be to actually perform shingling as it is documented to work: joining
adjacent tokens into a single search term?  (e.g. "apple" "pie" should
become "apple pie")

Thanks  a lot for the help.


P.S. Markus, putting double quotes around the query doesn't seem to have any
effect.  It would be nice to have the analysis debug output on the actual
queries so that I could see what is being searched for after analysis...

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message