lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word
Date Sun, 16 May 2010 15:16:42 GMT


Robert Muir commented on LUCENE-2465:

But instead of looking for whitespace, a quote would only be considered an opening quote if
it wasn't preceded by a letter or number or backslash.

great, now you break phrases for complex scripts because "letter" or "number" doesnt apply
(e.g. hindi/thai have non-spacing vowels).

This is why i say, the only solution is to follow unicode. Adding hacks like this will only
break other languages.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>                 Key: LUCENE-2465
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2,
2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
> Current implementation of Lucene's QueryParser identifies a phrase in the query when
hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test"
' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior
is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by, but no-where does it say double-quotes
will also tokenize the input. Arguably, a phrase should only be identified as such when it
is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible.
Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L),
hence causing the QP to throw a syntax exception, since it is expecting another double-quotes
to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace
precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace
follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on
anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message