lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Re: Trouble with Shingle filter and query parsing / expansion
Date Tue, 11 Aug 2009 21:16:23 GMT
One other idea I tried, which didn't work, was to see if I could get proper
parsing via the stream arg:

http://localhost:8983/solr/mlt?stream.body=hello+world&mlt.fl=shingle_field&mlt.mintf=0&debugQuery=true


On Tue, Aug 11, 2009 at 9:09 AM, Mark Bennett <mbennett@ideaeng.com> wrote:

> I've got an index building with the shingle filter and I can see the
> compound terms with Luke, etc.  So far so good.  One detail, I did tell it
> to not emit unigrams - I've got single words covered in a normal field.
>
> And a bit of poking around the other day explained why shingle queries
> weren't working with the dismax handler in 1.4, also fine, I believe I
> understand now.
>
> But switching to the standard query handler, I still don't get proper
> multi-word shingle handling in any query, either via the web interface nor
> the various Java calls.  I'm guessing it has to do with the order tokens are
> parsed in, but if so I'm not sure what the workaround is.
>
> Some things I've tried:
>
> Standard Solr query:
> ...&q=shingle_field:hello+world&debugQuery=true
>
> Standard Solr query, with the detault field set to the shingle field:
> ...&q=hello+world&debugQuery=true
>
> Standard Solr query, with the detault field set to the shingle field:
> ...&q="hello+world"&debugQuery=true
>
> I switched over to Java.  Regular queries worked pretty easily, I could
> print them out.  But attempts to conjure a shingle query always produce
> nothing.
>
> // fieldName = shingle field
> SolrQueryParser qp = new SolrQueryParser( schema, fieldName );
> Query q = qp.parse( "hello world" );
> System.out.println( "Query Object = " + q );
>
> SolrQuery q = new SolrQuery();
> q.addField( fieldName );  // Just setting a return field I think....
> q.setQuery( "hello world" );
> System.out.println( "Query Object = " + q );
>
> // And I figured this one wouldn't work:
> SolrQueryParser qp = new SolrPluginUtils.DisjunctionMaxQueryParser(
>                      schema, fieldName );
> SolrQuery q = qp.parse( "hello world" );
> Query q = qp.parse( "hello world" );
> System.out.println( "Query Object = " + q );
>
> Looking at the constructors for
> org.apache.lucene.analysis.shingle.ShingleFilter they all seem to want a
> token stream, vs. a string.  But I think the default query entry points into
> Solr are what's getting me to the single token at a time problem.
>
> I did verify that it's finding my schema, and if I put a non-existent field
> name in there, it certainly notices.    I've tried with and without the
> PositionFilterFactory filter.  If I comment out the shingle stage everything
> works.
>
>     <fieldType name="text_shingle" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="stopwords.txt"
>                 enablePositionIncrements="false"
>                 />
>         <filter class="solr.WordDelimiterFilterFactory"
>                 generateWordParts="0"
>                 generateNumberParts="0"
>                 catenateWords="1"
>                 catenateNumbers="1"
>                 catenateAll="0"
>                 splitOnCaseChange="0"
>                 stemEnglishPossessive="0"
>                 preserveOriginal="0"
>         />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="false"/>
>         <filter class="solr.PositionFilterFactory" />
>       </analyzer>
>     </fieldType>
>
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message