lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: dismax query does not match with additional field in qf
Date Tue, 07 Oct 2014 16:16:20 GMT
Your query term seems particularly inappropriate for dismax - think simple 
keyword queries.

Also, don't confuse dismax and edismax - maybe you want the latter. The 
former is for... simple keyword queries.

I'm still not sure what your actual use case really is. In particular, are 
you trying to do a full, exact match on the string field, or a substring 
match? You can do the latter with wildcards or regex, but normally the 
former (exact match) is used.

Maybe simply enclosing the complex term in quotes to make it a phrase query 
is what you need - that would do an exact match on the string field, but a 
tokenized phrase match on the text field, and support partial matches on the 
text field as a phrase of contiguous terms.

-- Jack Krupansky

-----Original Message----- 
From: Andreas Hubold
Sent: Tuesday, October 7, 2014 12:08 PM
To: solr-user@lucene.apache.org
Subject: Re: dismax query does not match with additional field in qf

Okay, sounds reasonable. However I didn't expect this when reading the
documentation of the dismax query parser.

Especially the need to escape special characters (and which ones) was
not clear to me as the dismax query parser "is designed to process
simple phrases (without complex syntax) entered by users" and "special
characters (except AND and OR) are escaped" by the parser - as written
on https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser

Do you know if the new Simple Query Parser has the same behaviour when
searching across multiple fields? Or could it be used instead to search
across "text_general" and "string" fields of arbitrary content without
additional query preprocessing to get results for matches in any of
these fields (as in field1:STUFF OR field2:STUFF).

Thank you,
Andreas

Jack Krupansky wrote on 10/07/2014 05:24 PM:
> I think what is happening is that your last term, the naked apostrophe is 
> analyzing to zero terms and simply being ignored, but when you add the 
> extra field, a string field, you now have another term in the query, and 
> you have mm set to 100%, so that "new" term must match. It probably fails 
> because you have no naked apostrophe term in that field in the index.
>
> Probably none of your string field terms were matching before, but that 
> wasn't apparent since the tokenized text matched. But with this naked 
> apostrophe term, there is no way to tell Lucene to match "no" term, so it 
> requried the string term to match, which won't happen since only the full 
> string is indexed.
>
> Generally, you need to escape all special characters in a query. Then 
> hopefully your string field will match.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Andreas Hubold
> Sent: Tuesday, September 30, 2014 11:14 AM
> To: solr-user@lucene.apache.org
> Subject: dismax query does not match with additional field in qf
>
> Hi,
>
> I ran into a problem with the Solr dismax query parser. We're using Solr
> 4.10.0 and the field types mentioned below are taken from the example
> schema.xml.
>
> In a test we have a document with rather strange content in a field
> named "name_tokenized" of type "text_general":
>
> abc_<iframe src='loadLocale.js' onload='javascript:document.XSSed="name"' 
> width=0 height=0>
>
> (It's a test for XSS bug detection, but that doesn't matter here.)
>
> I can find the document when I use the following dismax query with qf
> set to field "name_tokenized" only:
>
> http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2
>
> If I submit exactly the same query but add another field "feederstate"
> to the qf parameter, I don't get any results anymore. The field is of
> type "string".
>
> http://localhost:44080/solr/studio/editor?deftype=dismax&q=abc_%3Ciframe+src%3D%27loadLocale.js%27+onload%3D%27javascript%3Adocument.XSSed%3D%22name%22%27&debug=true&echoParams=all&qf=name_tokenized^2%20feederstate
>
> The decoded value of q is: abc_<iframe src='loadLocale.js'
> onload='javascript:document.XSSed="name"' and it seems the trailing
> single-quote causes problems here. (In fact, I can find the document
> when I remove the last char)
> The parsed query for the latter case is
>
> (
>   +((
>     DisjunctionMaxQuery((feederstate:abc_<iframe | ((name_tokenized:abc_ 
> name_tokenized:iframe)^2.0))~0.1)
>     DisjunctionMaxQuery((feederstate:src='loadLocale.js' | 
> ((name_tokenized:src name_tokenized:loadlocale.js)^2.0))~0.1)
> DisjunctionMaxQuery((feederstate:onload='javascript:document.XSSed= | 
> ((name_tokenized:onload 
> name_tokenized:javascript:document.xssed)^2.0))~0.1)
>     DisjunctionMaxQuery((feederstate:name | name_tokenized:name^2.0)~0.1)
>     DisjunctionMaxQuery((feederstate:')~0.1)
>   )~5)
>
>   DisjunctionMaxQuery((textbody:"abc_ iframe src loadlocale.js onload 
> javascript:document.xssed name" | name_tokenized:"abc_ iframe src 
> loadlocale.js onload javascript:document.xssed name"^2.0)~0.1)
> )/no_coord
>
>
> I've configured the handler with <str name="mm">100%</str> so that all
> of the 5 dismax queries at the top must match. But this one does not 
> match:
>
> DisjunctionMaxQuery((feederstate:')~0.1)
>
>
> I'd expect that an additional field in the qf parameter would not lead
> to fewer matches.
> Okay, the above example is a rather crude test but I'd like to
> understand it. Is this a bug in Solr?
>
> I've also found https://issues.apache.org/jira/browse/SOLR-3047 which
> sounds somewhat similar.
>
> Regards,
> Andreas
> .
>


Mime
View raw message