lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <yo...@lucidimagination.com>
Subject Re: Proper Escaping of Ampersands
Date Mon, 23 Aug 2010 19:52:49 GMT
I'd recommend going back to the "textgen" field type as defined in the
example schema.
Your move of the StopFilter is what is causing the problem.
At index time, the "s" gets removed (because the StopFilter is now
after the WDF).
But a query of "at&s" is transformed into "at s" (the s isn't removed
because StopFilter is before WDF for the query analyzer).  Since "s"
isn't in the index, no docs are found.

Also, I notice you're using preserveOriginal=1 - make sure you really
need that... it's normally only useful if you are doing wildcard
searches (for example at&*).

-Yonik
http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8


On Mon, Aug 23, 2010 at 5:43 AM, Nikolas Tautenhahn
<nik_solr@livinglogic.de> wrote:
> Hi Yonik,
>
> I got it working, but I think the Stopword Filter is not behaving as
> expected - (The document could be found when I disabled the stopword
> filter, details later in this mail...)
>
> On 20.08.2010 16:57, Yonik Seeley wrote
>> On Thu, Aug 19, 2010 at 11:33 AM, Nikolas Tautenhahn
>> <nik_solr@livinglogic.de> wrote:
>>> But when I search for q=at%26s (=at&s), I get nothing.
>>
>> That's the correct encoding if you're typing it directly into a
>> browser address box.
>> http://localhost:8983/solr/select?defType=dismax&qf=text&q=at%26s&debugQuery=true
>>
>> But you should be able to verify that solr is getting the correct
>> query string by checking out "params" in the response (in the example
>> server, by default they are echoed back).  And adding debugQuery=true
>> to the request should show you exactly what query is being generated.
>>
>> But the real issue likely lies with your fieldType definition.  Can
>> you show that?
>
> As I (normally) query multiple fields, I changed my request URL to
> http://127.0.0.1:8983/solr/select?q=at%26s&fl=titel&qt=dismax&qf=titel&debugQuery=truefl=*&qt=dismax&qf=titel&debugQuery=true
> in order to narrow it down and got this response (cut to, as I think,
> relevant stuff)
>
>> <str name="rawquerystring">at&s</str>
>> <str name="querystring">at&s</str>
>> <str name="parsedquery">+DisjunctionMaxQuery((titel:"(at&s at) s")~0.1)
()</str>
>> <str name="parsedquery_toString">+(titel:"(at&s at) s")~0.1 ()</str>
>> <lst name="explain"/>
>> <str name="QParser">DisMaxQParser</str>
>
> on my local debugging instance, using standard dismax config (from the
> examples directory at solr).
>
> The "titel"-Field is configured like this:
>
>>   <field name="titel" type="textgen" indexed="true" stored="true"/>
>
> and "textgen" is configured like this
>
>>     <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
>>       <analyzer type="index">
>>         <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="false"/>
>>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
preserveOriginal="1"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true"/>
>>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
preserveOriginal="1"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>
> The document is indexed correctly, a search for "at s" found it and all
> fields looked great ("at&s and not for example, at&amp;s).
>
> As my stopword list does not contain "at" or "&" or "&amp;", I don't
> quite understand, why my result is found, when I disable the
> stopword-list. My stopwordlist can be found here
>
> http://pastebin.com/RfLuBHqd
>
> Do you happen to see bad things for a string like "at&s" here?
>
> The analysis page in the admin panel tells me, these steps for the Index
> Analyzer:
>
> (HTMLStripStandardTokenizer) at&s => at&s
> (SynonymFilter) at&s => at&s
> (WordDelimiterFilter) at&s => term position 1: at&s, at; term pos 2: s, ats
> (LowerCaseFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: s, ats
> (StopFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: ats
>
> So, according to this, it should be found even with my stopwords enabled...
>
>
> best regards and thanks for your response,
> Nikolas Tautenhahn
>

Mime
View raw message