lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] Commented: (LUCENE-2458) queryparser makes all CJK queries phrase queries regardless of analyzer
Date Wed, 26 May 2010 21:39:36 GMT


Mark Miller commented on LUCENE-2458:

How about making the setting ("if analyzer returns more than 1 token for a
single chunk of whitespace-separated text, make a PhraseQuery")
configurable (instead of hardwired according to Version)? And defaulting it
to off for Version >= 31 (so CJK, etc., work out of the box)?

I think its pretty clear this would make most people happy.

Personally, I'm somewhat on board with Robert that this may really hamstring us when it comes
to further fixes that are needed/wanted in the future.

To note though - I think in general, most who have commented on this issue are into making
CJK work out of the box. But I really think we need to nail down more consensus on this first.

At a minimum, I think making the behavior configurable, while defaulting to CJK 'betterness'
by default has pretty much everyone on board.

But I'd really like to discuss whether doing that will only lead to losing that option as
we do things like stop qp from splitting on whitespace in the future...

Something I was thinking, and it might be more of a maintenance headache than its worth, but
we could demote this queryparser from the core query parser, and rename it something like
ClassicQueryParser (or whatever), and make a new QueryParser that is better for more languages
across the board (originally basing it on the classic parser eg this patch to start). People
that like the older more english biased QueryParser can still use it, and by default, new
users will likely pick up the default QueryParser that works better with more languages out
of the box?

Just an idea.

In any event - I think this patch is a step forward too - but it looks to me like there are
still open concerns and objections.

> queryparser makes all CJK queries phrase queries regardless of analyzer
> -----------------------------------------------------------------------
>                 Key: LUCENE-2458
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>         Attachments: LUCENE-2458.patch, LUCENE-2458.patch, LUCENE-2458.patch
> The queryparser automatically makes *ALL* CJK, Thai, Lao, Myanmar, Tibetan, ... queries
into phrase queries, even though you didn't ask for one, and there isn't a way to turn this
> This completely breaks lucene for these languages, as it treats all queries like 'grep'.
> Example: if you query for f:abcd with standardanalyzer, where a,b,c,d are chinese characters,
you get a phrasequery of "a b c d". if you use cjk analyzer, its no better, its a phrasequery
of  "ab bc cd", and if you use smartchinese analyzer, you get a phrasequery like "ab cd".
But the user didn't ask for one, and they cannot turn it off.
> The reason is that the code to form phrase queries is not internationally appropriate
and assumes whitespace tokenization. If more than one token comes out of whitespace delimited
text, its automatically a phrase query no matter what.
> The proposed patch fixes the core queryparser (with all backwards compat kept) to only
form phrase queries when the double quote operator is used. 
> Implementing subclasses can always extend the QP and auto-generate whatever kind of queries
they want that might completely break search for languages they don't care about, but core
general-purpose QPs should be language independent.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message