lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
Date Fri, 01 Jul 2016 01:40:11 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steve Rowe updated LUCENE-2605:
-------------------------------
    Attachment: LUCENE-2605.patch

Okay, really final patch.  On SOLR-9185 I was having trouble integrating the Solr standard
QP's comment support with the whitespace tokenization I introduced here, so I tried switching
the Solr parser back to ignoring both whitespace and comments, and it worked.  The patch brings
this grammar simplification back here too - in addition to many fewer whitespace mentions
in the rules, fewer (and less complicated) lookaheads are required.

I've included the generated files in the patch.

No tests changed from the last patch.

All Lucene tests pass, and precommit passes.

> queryparser parses on whitespace
> --------------------------------
>
>                 Key: LUCENE-2605
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2605
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/queryparser
>            Reporter: Robert Muir
>            Assignee: Steve Rowe
>         Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch,
LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace separated term
to its own independent token stream.
> This breaks the following at query-time, because they can't see across whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters
will do the same thing at index and querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse around only
real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message