lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Hubold <andreas.hub...@coremedia.com>
Subject Re: dismax query does not match with additional field in qf
Date Wed, 08 Oct 2014 08:39:17 GMT
The query is not from a real use-case. We used it to test edge cases. I 
just asked to better understand the parser as its behavior did not match 
my expectations.

Anyway, one use-case I can think of is a free search field for end-users 
where they can search in both ID and text fields including phrases - 
without specifying whether their query is an ID or full-text. Users 
typically just expect the "right thing" to happen. So application 
developers have to be aware of such effects. Maybe the newer simple 
query parser would be a better fit for us.

There were also some good comments in SOLR-6602, especially a link to 
SOLR-3085 which describes a more realistic case with stopword removal.

Thanks everybody!

Regards,
Andreas

Jack Krupansky wrote on 10/07/2014 06:16 PM:
> Your query term seems particularly inappropriate for dismax - think 
> simple keyword queries.
>
> Also, don't confuse dismax and edismax - maybe you want the latter. 
> The former is for... simple keyword queries.
>
> I'm still not sure what your actual use case really is. In particular, 
> are you trying to do a full, exact match on the string field, or a 
> substring match? You can do the latter with wildcards or regex, but 
> normally the former (exact match) is used.
>
> Maybe simply enclosing the complex term in quotes to make it a phrase 
> query is what you need - that would do an exact match on the string 
> field, but a tokenized phrase match on the text field, and support 
> partial matches on the text field as a phrase of contiguous terms.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Andreas Hubold
> Sent: Tuesday, October 7, 2014 12:08 PM
> To: solr-user@lucene.apache.org
> Subject: Re: dismax query does not match with additional field in qf
>
> Okay, sounds reasonable. However I didn't expect this when reading the
> documentation of the dismax query parser.
>
> Especially the need to escape special characters (and which ones) was
> not clear to me as the dismax query parser "is designed to process
> simple phrases (without complex syntax) entered by users" and "special
> characters (except AND and OR) are escaped" by the parser - as written
> on 
> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
>
> Do you know if the new Simple Query Parser has the same behaviour when
> searching across multiple fields? Or could it be used instead to search
> across "text_general" and "string" fields of arbitrary content without
> additional query preprocessing to get results for matches in any of
> these fields (as in field1:STUFF OR field2:STUFF).
>
> Thank you,
> Andreas
>
> Jack Krupansky wrote on 10/07/2014 05:24 PM:
>> I think what is happening is that your last term, the naked 
>> apostrophe is analyzing to zero terms and simply being ignored, but 
>> when you add the extra field, a string field, you now have another 
>> term in the query, and you have mm set to 100%, so that "new" term 
>> must match. It probably fails because you have no naked apostrophe 
>> term in that field in the index.
>>
>> Probably none of your string field terms were matching before, but 
>> that wasn't apparent since the tokenized text matched. But with this 
>> naked apostrophe term, there is no way to tell Lucene to match "no" 
>> term, so it requried the string term to match, which won't happen 
>> since only the full string is indexed.
>>
>> Generally, you need to escape all special characters in a query. Then 
>> hopefully your string field will match.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Andreas Hubold
>> Sent: Tuesday, September 30, 2014 11:14 AM
>> To: solr-user@lucene.apache.org
>> Subject: dismax query does not match with additional field in qf
>>
>> Hi,
>>
>> I ran into a problem with the Solr dismax query parser. We're using Solr
>> 4.10.0 and the field types mentioned below are taken from the example
>> schema.xml.
>>
>> In a test we have a document with rather strange content in a field
>> named "name_tokenized" of type "text_general":
>>
>> abc_<iframe src='loadLocale.js' 
>> onload='javascript:document.XSSed="name"' width=0 height=0>
>>
>> (It's a test for XSS bug detection, but that doesn't matter here.)
>>
>> I can find the document when I use the following dismax query with qf
>> set to field "name_tokenized" only:
>>
>> http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2

>>
>>
>> If I submit exactly the same query but add another field "feederstate"
>> to the qf parameter, I don't get any results anymore. The field is of
>> type "string".
>>
>> http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2%20feederstate

>>
>>
>> The decoded value of q is: abc_<iframe src='loadLocale.js'
>> onload='javascript:document.XSSed="name"' and it seems the trailing
>> single-quote causes problems here. (In fact, I can find the document
>> when I remove the last char)
>> The parsed query for the latter case is
>>
>> (
>>   +((
>>     DisjunctionMaxQuery((feederstate:abc_<iframe | 
>> ((name_tokenized:abc_ name_tokenized:iframe)^2.0))~0.1)
>>     DisjunctionMaxQuery((feederstate:src='loadLocale.js' | 
>> ((name_tokenized:src name_tokenized:loadlocale.js)^2.0))~0.1)
>> DisjunctionMaxQuery((feederstate:onload='javascript:document.XSSed= | 
>> ((name_tokenized:onload 
>> name_tokenized:javascript:document.xssed)^2.0))~0.1)
>>     DisjunctionMaxQuery((feederstate:name | 
>> name_tokenized:name^2.0)~0.1)
>>     DisjunctionMaxQuery((feederstate:')~0.1)
>>   )~5)
>>
>>   DisjunctionMaxQuery((textbody:"abc_ iframe src loadlocale.js onload 
>> javascript:document.xssed name" | name_tokenized:"abc_ iframe src 
>> loadlocale.js onload javascript:document.xssed name"^2.0)~0.1)
>> )/no_coord
>>
>>
>> I've configured the handler with <str name="mm">100%</str> so that all
>> of the 5 dismax queries at the top must match. But this one does not 
>> match:
>>
>> DisjunctionMaxQuery((feederstate:')~0.1)
>>
>>
>> I'd expect that an additional field in the qf parameter would not lead
>> to fewer matches.
>> Okay, the above example is a rather crude test but I'd like to
>> understand it. Is this a bug in Solr?
>>
>> I've also found https://issues.apache.org/jira/browse/SOLR-3047 which
>> sounds somewhat similar.
>>
>> Regards,
>> Andreas
>> .
>>
>
> .
>



Mime
View raw message