lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Elschot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
Date Sun, 15 Jun 2014 10:18:01 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031826#comment-14031826
] 

Paul Elschot commented on LUCENE-5205:
--------------------------------------

bq.  I'm wondering if I should add FieldMaskingSpanQueries to the SpanOnlyParser

When two fields are indexed to allow a FieldMaskingSpanQuery (see  LUCENE-1494 for an example)
such an addition makes sense:

field1:[ v1 field2:v2]

Here the masking should be from field2 to field1.

There is a scoring issue for FieldMaskingSpanQuery, LUCENE-3723. So far I have avoided scoring
in the label module...

For querying labeled fragments, a FieldMaskingSpanQuery should be used between two fragment
fields that share their labeled positions, or when each fragment in one field consist of a
single token. The first case happens in the label module for xml attribute names and attribute
values. For the single token fragments case there is no special provision in the label module.




> [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
> -----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5205
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/queryparser
>            Reporter: Tim Allison
>              Labels: patch
>             Fix For: 4.9
>
>         Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch,
LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch,
LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt
>
>
> This parser extends QueryParserBase and includes functionality from:
> * Classic QueryParser: most of its syntax
> * SurroundQueryParser: recursive parsing for "near" and "not" clauses.
> * ComplexPhraseQueryParser: can handle "near" queries that include multiterms (wildcard,
fuzzy, regex, prefix),
> * AnalyzingQueryParser: has an option to analyze multiterms.
> At a high level, there's a first pass BooleanQuery/field parser and then a span query
parser handles all terminal nodes and phrases.
> Same as classic syntax:
> * term: test 
> * fuzzy: roam~0.8, roam~2
> * wildcard: te?t, test*, t*st
> * regex: /\[mb\]oat/
> * phrase: "jakarta apache"
> * phrase with slop: "jakarta apache"~3
> * default "or" clause: jakarta apache
> * grouping "or" clause: (jakarta apache)
> * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
> * multiple fields: title:lucene author:hatcher
>  
> Main additions in SpanQueryParser syntax vs. classic syntax:
> * Can require "in order" for phrases with slop with the \~> operator: "jakarta apache"\~>3
> * Can specify "not near": "fever bieber"!\~3,10 ::
>     find "fever" but not if "bieber" appears within 3 words before or 10 words after
it.
> * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~>4
:: 
>     find "jakarta" within 3 words of "apache", and that hit has to be within four words
before "lucene"
> * Can also use \[\] for single level phrasal queries instead of " as in: \[jakarta apache\]
> * Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)"\~3 :: find
"apache" and then either "lucene" or "solr" within three words.
> * Can use multiterms in phrasal queries: "jakarta\~1 ap*che"\~2
> * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10
:: Find something like "jakarta" within two words of "ap*che" and that hit has to be within
ten words of something like "solr" or that "lucene" regex.
> * Can require at least x number of hits at boolean level: "apache AND (lucene solr tika)~2
> * Can use negative only query: -jakarta :: Find all docs that don't contain "jakarta"
> * Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of potential
performance issues!).
> Trivial additions:
> * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2)
> * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance <=2: (jakarta~1
(OSA) vs jakarta~>1(Levenshtein)
> This parser can be very useful for concordance tasks (see also LUCENE-5317 and LUCENE-5318)
and for analytical search.  
> Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
> Most of the documentation is in the javadoc for SpanQueryParser.
> Any and all feedback is welcome.  Thank you.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message