lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikolas Tautenhahn <>
Subject Re: Proper Escaping of Ampersands
Date Mon, 23 Aug 2010 09:43:37 GMT
Hi Yonik,

I got it working, but I think the Stopword Filter is not behaving as
expected - (The document could be found when I disabled the stopword
filter, details later in this mail...)

On 20.08.2010 16:57, Yonik Seeley wrote
> On Thu, Aug 19, 2010 at 11:33 AM, Nikolas Tautenhahn
> <> wrote:
>> But when I search for q=at%26s (=at&s), I get nothing.
> That's the correct encoding if you're typing it directly into a
> browser address box.
> http://localhost:8983/solr/select?defType=dismax&qf=text&q=at%26s&debugQuery=true
> But you should be able to verify that solr is getting the correct
> query string by checking out "params" in the response (in the example
> server, by default they are echoed back).  And adding debugQuery=true
> to the request should show you exactly what query is being generated.
> But the real issue likely lies with your fieldType definition.  Can
> you show that?

As I (normally) query multiple fields, I changed my request URL to*&qt=dismax&qf=titel&debugQuery=true
in order to narrow it down and got this response (cut to, as I think,
relevant stuff)

> <str name="rawquerystring">at&s</str>
> <str name="querystring">at&s</str>
> <str name="parsedquery">+DisjunctionMaxQuery((titel:"(at&s at) s")~0.1) ()</str>
> <str name="parsedquery_toString">+(titel:"(at&s at) s")~0.1 ()</str>
> <lst name="explain"/>
> <str name="QParser">DisMaxQParser</str>

on my local debugging instance, using standard dismax config (from the
examples directory at solr).

The "titel"-Field is configured like this:

>   <field name="titel" type="textgen" indexed="true" stored="true"/>

and "textgen" is configured like this

>     <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
> 	<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
>         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>

The document is indexed correctly, a search for "at s" found it and all
fields looked great ("at&s and not for example, at&amp;s).

As my stopword list does not contain "at" or "&" or "&amp;", I don't
quite understand, why my result is found, when I disable the
stopword-list. My stopwordlist can be found here

Do you happen to see bad things for a string like "at&s" here?

The analysis page in the admin panel tells me, these steps for the Index

(HTMLStripStandardTokenizer) at&s => at&s
(SynonymFilter) at&s => at&s
(WordDelimiterFilter) at&s => term position 1: at&s, at; term pos 2: s, ats
(LowerCaseFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: s, ats
(StopFilter) 1: at&s, at; 2: s, ats => 1: at&s, at; 2: ats

So, according to this, it should be found even with my stopwords enabled...

best regards and thanks for your response,
Nikolas Tautenhahn

View raw message