lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Date Tue, 11 May 2010 21:55:41 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866363#action_12866363
] 

Hoss Man commented on LUCENE-2458:
----------------------------------

bq. a Boolean Query formed with the default operator.

That seems like equally bad default behavior -- lots of existing TokenFilters produce chains
of tokens for situations where the user creating the query string clearly intended to be searching
for a single "word" and has no idea that as an implementation detail multiple tokens were
produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...)

I haven't thought this through very well, but perhaps this is an area where (the new) Token
Attributes could be used to instruct QueryParser as to the intent behind a stream of multiple
tokens?  A new Attribute could be used on each token to convey when that token should be combined
with teh previous token, and in what way: as a phrase, as a conjunction or as a disjunction.
 (this could still be orthogonal to the position, which would indicate slop/span type information
like it does currently)

Stock Analysys components that produce multiple tokens could be modified to add this attribute
fairly easily (it should be a relatively static value for any component that currently "splits"
tokens) and QueryParser could have an option controlling what to do if  it encounters a token
w/o this attribute (perhaps even two options: one for quoted input chunks and one for unquoted
input chunks).

that way the default could still work in a back compatible way, but people using languages
that don't use whitespace separation *and* are using older (or custom) analyzers that don't
know about this attribute could set a simple query parser property to force this behavior.

would that make sense? (asks the man who only vaguely understands Token Attributes at this
point)

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html)
states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term count is used
as some sort of "heuristic" to determine if its a phrase query or not.
> This assumption is a disaster for languages that don't use whitespace separation: CJK,
compounding European languages like German, Finnish, etc. It also
> makes it difficult for people to use n-gram analysis techniques. In these cases you get
bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to "turn
this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases its being
abused as some heuristic to "second guess" the tokenizer and piece back things it shouldn't
have split, but for large collections, doing things like generating phrasequeries because
StandardTokenizer split a compound on a dash can cause serious performance problems. Instead
people should analyze their text with the appropriate methods, and QueryParser should only
generate phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and
people are not familiar with it. The result is we have bad out-of-box behavior for many languages,
and bad performance for others on some inputs.
> I propose instead that we change the grammar to actually look for double quotes to determine
when to generate a phrase query, consistent with the documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message