lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pratik Patel <pra...@semandex.net>
Subject Re: How to figure out whether stopwords are being indexed or not
Date Wed, 22 Feb 2017 17:22:33 GMT
That explains why I was getting back the results. Thanks! I was doing that
query only to test whether stopwords are being indexed or not but
apparently the query I had would not serve the purpose.  I should rather
have a document field with just the stop word and search against it without
using wildcard to test whether the stopword was indexed or not. Thanks
again.

Regards,
Pratik

On Wed, Feb 22, 2017 at 12:10 PM, Alexandre Rafalovitch <arafalov@gmail.com>
wrote:

> StopFilterFactory (and WordDelimiterFilterFactory and maybe others)
> are NOT multiterm aware.
>
> Using wildcards triggers the edge-case third type of analyzer chain
> that is automatically constructed unless you specify it explicitly.
>
> You can see the full list of analyzers and whether they are multiterm
> aware at http://www.solr-start.com/info/analyzers/ (I mark them with
> "(multi)").
>
> Solution in your case is probably to go away from these
> performance-killing double-side wildcards and to switch to the NGrams
> instead. And you may want to look at ApostropheFilterFactory while you
> are at it (instead of regexp you have there).
>
> Regards,
>    Alex.
>
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 22 February 2017 at 12:02, Pratik Patel <pratik@semandex.net> wrote:
> > Asterisks were not for formatting, I was trying to use a wildcard
> operator.
> > Here is another example query and "parsed_query toString" entry for that.
> >
> > Query :
> > http://localhost:8081/solr/collection1/select?debugQuery=
> on&indent=on&q=Description_note:*their*&wt=json
> >
> > "parsedquery_toString":"Description_note:*their*"
> >
> > I have word "their" in my stopwords list so I am expecting zero results
> but
> > this query returns 20 documents with word "their"
> >
> > Here is more of the debug object of response.
> >
> >
> > "debug":{
> >     "rawquerystring":"Description_note:*their*",
> >     "querystring":"Description_note:*their*",
> >     "parsedquery":"Description_note:*their*",
> >     "parsedquery_toString":"Description_note:*their*",
> >     "explain":{
> >       "54227b012a1c4e574f88505556987be57ef1af28d01b6d94":"\n1.0 =
> > Description_note:*their*, product of:\n  1.0 = boost\n  1.0 =
> > queryNorm\n", ....
> >       },
> >     "QParser":"LuceneQParser",
> >     "timing":{ ... }
> >
> > }
> >
> > Thanks,
> >
> > Pratik
> >
> >
> >
> >
> >
> >
> > On Wed, Feb 22, 2017 at 11:25 AM, Erick Erickson <
> erickerickson@gmail.com>
> > wrote:
> >
> >> That's not what I'm looking for. Way down near the end there should be
> >> an entry like
> >> "parsed_query toString"
> >>
> >> This line is pretty suspicious: 82, "params":{ "q":"Description_note:*
> >> and *"
> >>
> >> Are you really searching for asterisks (I'd originally interpreted
> >> that as bolding
> >> which sometimes happens). Please don't do formatting with asterisks in
> >> e-mails as it's very confusing.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >> On Wed, Feb 22, 2017 at 8:12 AM, Pratik Patel <pratik@semandex.net>
> wrote:
> >> > Hi Eric,
> >> >
> >> > Thanks for the reply! Following is the relevant part of response
> header
> >> > with debugQuery on.
> >> >
> >> > {
> >> > "responseHeader":{ "status":0, "QTime":282, "params":{
> >> "q":"Description_note:*
> >> > and *", "indent":"on", "wt":"json", "debugQuery":"on",
> >> "_":"1487773835305"}},
> >> > "response":{"numFound":81771,"start":0,"docs":[ { "id":"<id>", .
> >> > .
> >> > .
> >> > },..
> >> > ]
> >> > }
> >> > }
> >> >
> >> >
> >> > On Tue, Feb 21, 2017 at 8:22 PM, Erick Erickson <
> erickerickson@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> Attach &debug=query to your query and look at the parsed query
that's
> >> >> returned.
> >> >> That'll tell you what was searched at least.
> >> >>
> >> >> You can also use the TermsComponent to examine terms in a field
> >> directly.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Tue, Feb 21, 2017 at 2:52 PM, Pratik Patel <pratik@semandex.net>
> >> wrote:
> >> >> > I have a field type in schema which has been applied stopwords
> list.
> >> >> > I have verified that path of stopwords file is correct and it
is
> being
> >> >> > loaded fine in solr admin UI. When I analyse these fields using
> >> >> "Analysis" tab
> >> >> > of the solr admin UI, I can see that stopwords are being filtered
> out.
> >> >> > However, when I query with some of these stopwords, I do get the
> >> results
> >> >> > back which makes me think that probably stopwords are being
> indexed.
> >> >> >
> >> >> > For example, when I run following query, I do get back results.
I
> have
> >> >> word
> >> >> > "and" in the stopwords list so I expect no results for this query.
> >> >> >
> >> >> > http://localhost:8081/solr/collection1/select?fq=
> >> >> Description_note:*%20and%20*&indent=on&q=*:*&rows=100&
> start=0&wt=json
> >> >> >
> >> >> > Does this mean that the "and" word is being indexed and stopwords
> are
> >> not
> >> >> > being used?
> >> >> >
> >> >> > Following is the field type of field Description_note :
> >> >> >
> >> >> >
> >> >> > <fieldType name="text_general" class="solr.TextField"
> >> >> > positionIncrementGap="100" omitNorms="true">
> >> >> >       <analyzer type="index">
> >> >> >       <charFilter class="solr.HTMLStripCharFilterFactory" />
> >> >> > <tokenizer class="solr.StandardTokenizerFactory"/>
> >> >> > <filter class="solr.LowerCaseFilterFactory"/>
> >> >> > <charFilter class="solr.PatternReplaceCharFilterFactory"
> >> >> > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> >> >> > <filter class="solr.WordDelimiterFilterFactory"
> >> >> protected="protwords.txt"
> >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
> >> >> >         <filter class="solr.KStemFilterFactory" />
> >> >> >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> >> > words="stopwords.txt" />
> >> >> >       </analyzer>
> >> >> >       <analyzer type="query">
> >> >> >       <charFilter class="solr.HTMLStripCharFilterFactory" />
> >> >> >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >> >> > <filter class="solr.LowerCaseFilterFactory"/>
> >> >> > <charFilter class="solr.PatternReplaceCharFilterFactory"
> >> >> > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> >> >> > <filter class="solr.WordDelimiterFilterFactory"
> >> >> protected="protwords.txt"
> >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
> >> >> >         <filter class="solr.KStemFilterFactory" />
> >> >> >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> >> > words="stopwords.txt" />
> >> >> >       </analyzer>
> >> >> >     </fieldType>
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message