lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: How to figure out whether stopwords are being indexed or not
Date Wed, 22 Feb 2017 17:10:34 GMT
StopFilterFactory (and WordDelimiterFilterFactory and maybe others)
are NOT multiterm aware.

Using wildcards triggers the edge-case third type of analyzer chain
that is automatically constructed unless you specify it explicitly.

You can see the full list of analyzers and whether they are multiterm
aware at http://www.solr-start.com/info/analyzers/ (I mark them with
"(multi)").

Solution in your case is probably to go away from these
performance-killing double-side wildcards and to switch to the NGrams
instead. And you may want to look at ApostropheFilterFactory while you
are at it (instead of regexp you have there).

Regards,
   Alex.

----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 February 2017 at 12:02, Pratik Patel <pratik@semandex.net> wrote:
> Asterisks were not for formatting, I was trying to use a wildcard operator.
> Here is another example query and "parsed_query toString" entry for that.
>
> Query :
> http://localhost:8081/solr/collection1/select?debugQuery=on&indent=on&q=Description_note:*their*&wt=json
>
> "parsedquery_toString":"Description_note:*their*"
>
> I have word "their" in my stopwords list so I am expecting zero results but
> this query returns 20 documents with word "their"
>
> Here is more of the debug object of response.
>
>
> "debug":{
>     "rawquerystring":"Description_note:*their*",
>     "querystring":"Description_note:*their*",
>     "parsedquery":"Description_note:*their*",
>     "parsedquery_toString":"Description_note:*their*",
>     "explain":{
>       "54227b012a1c4e574f88505556987be57ef1af28d01b6d94":"\n1.0 =
> Description_note:*their*, product of:\n  1.0 = boost\n  1.0 =
> queryNorm\n", ....
>       },
>     "QParser":"LuceneQParser",
>     "timing":{ ... }
>
> }
>
> Thanks,
>
> Pratik
>
>
>
>
>
>
> On Wed, Feb 22, 2017 at 11:25 AM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> That's not what I'm looking for. Way down near the end there should be
>> an entry like
>> "parsed_query toString"
>>
>> This line is pretty suspicious: 82, "params":{ "q":"Description_note:*
>> and *"
>>
>> Are you really searching for asterisks (I'd originally interpreted
>> that as bolding
>> which sometimes happens). Please don't do formatting with asterisks in
>> e-mails as it's very confusing.
>>
>> Best,
>> Erick
>>
>>
>> On Wed, Feb 22, 2017 at 8:12 AM, Pratik Patel <pratik@semandex.net> wrote:
>> > Hi Eric,
>> >
>> > Thanks for the reply! Following is the relevant part of response header
>> > with debugQuery on.
>> >
>> > {
>> > "responseHeader":{ "status":0, "QTime":282, "params":{
>> "q":"Description_note:*
>> > and *", "indent":"on", "wt":"json", "debugQuery":"on",
>> "_":"1487773835305"}},
>> > "response":{"numFound":81771,"start":0,"docs":[ { "id":"<id>", .
>> > .
>> > .
>> > },..
>> > ]
>> > }
>> > }
>> >
>> >
>> > On Tue, Feb 21, 2017 at 8:22 PM, Erick Erickson <erickerickson@gmail.com
>> >
>> > wrote:
>> >
>> >> Attach &debug=query to your query and look at the parsed query that's
>> >> returned.
>> >> That'll tell you what was searched at least.
>> >>
>> >> You can also use the TermsComponent to examine terms in a field
>> directly.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Tue, Feb 21, 2017 at 2:52 PM, Pratik Patel <pratik@semandex.net>
>> wrote:
>> >> > I have a field type in schema which has been applied stopwords list.
>> >> > I have verified that path of stopwords file is correct and it is being
>> >> > loaded fine in solr admin UI. When I analyse these fields using
>> >> "Analysis" tab
>> >> > of the solr admin UI, I can see that stopwords are being filtered out.
>> >> > However, when I query with some of these stopwords, I do get the
>> results
>> >> > back which makes me think that probably stopwords are being indexed.
>> >> >
>> >> > For example, when I run following query, I do get back results. I have
>> >> word
>> >> > "and" in the stopwords list so I expect no results for this query.
>> >> >
>> >> > http://localhost:8081/solr/collection1/select?fq=
>> >> Description_note:*%20and%20*&indent=on&q=*:*&rows=100&start=0&wt=json
>> >> >
>> >> > Does this mean that the "and" word is being indexed and stopwords are
>> not
>> >> > being used?
>> >> >
>> >> > Following is the field type of field Description_note :
>> >> >
>> >> >
>> >> > <fieldType name="text_general" class="solr.TextField"
>> >> > positionIncrementGap="100" omitNorms="true">
>> >> >       <analyzer type="index">
>> >> >       <charFilter class="solr.HTMLStripCharFilterFactory" />
>> >> > <tokenizer class="solr.StandardTokenizerFactory"/>
>> >> > <filter class="solr.LowerCaseFilterFactory"/>
>> >> > <charFilter class="solr.PatternReplaceCharFilterFactory"
>> >> > pattern="((?m)[a-z]+)'s" replacement="$1s" />
>> >> > <filter class="solr.WordDelimiterFilterFactory"
>> >> protected="protwords.txt"
>> >> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
>> >> >         <filter class="solr.KStemFilterFactory" />
>> >> >         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >> > words="stopwords.txt" />
>> >> >       </analyzer>
>> >> >       <analyzer type="query">
>> >> >       <charFilter class="solr.HTMLStripCharFilterFactory" />
>> >> >         <tokenizer class="solr.StandardTokenizerFactory"/>
>> >> > <filter class="solr.LowerCaseFilterFactory"/>
>> >> > <charFilter class="solr.PatternReplaceCharFilterFactory"
>> >> > pattern="((?m)[a-z]+)'s" replacement="$1s" />
>> >> > <filter class="solr.WordDelimiterFilterFactory"
>> >> protected="protwords.txt"
>> >> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
>> >> >         <filter class="solr.KStemFilterFactory" />
>> >> >         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >> > words="stopwords.txt" />
>> >> >       </analyzer>
>> >> >     </fieldType>
>> >>
>>

Mime
View raw message