lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word
Date Sun, 16 May 2010 13:20:43 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir resolved LUCENE-2465.
---------------------------------

    Resolution: Won't Fix

This isn't a bug, as mentioned *you* need to use the correct Unicode character, it does not
matter
if its on your users keyboard or not. 

Its your responsibility, to disambiguate (with whatever logic you want), that U+0022 should
really be  U+05F4.
then it will work correctly with Lucene (including StandardTokenizer).


> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2,
2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when
hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test"
' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior
is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes
will also tokenize the input. Arguably, a phrase should only be identified as such when it
is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible.
Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L),
hence causing the QP to throw a syntax exception, since it is expecting another double-quotes
to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace
precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace
follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on
anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message