lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Ferenczi (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder
Date Tue, 17 Jan 2017 14:00:29 GMT


Jim Ferenczi commented on LUCENE-7638:

Thanks for taking a look @mattweber and [~mikemccand]

Can this handle "side paths on side paths" graph structure (I think you called this "nested
multi-term synonyms")? While no analysis chain can naturally produce this today (that I know
of!), the TokenStream attributes can easily express it. And you could imagine it happening
in the future, e.g. if you use Kuromoji tokenizer or WordDelimiterGraphFilter followed by
a SynonymGraphFilter (except we'd need to e.g. use the synonym graph filter from LUCENE-5012,
which can correctly consume a graph). If this is expected to work maybe we should add a test
case showing that?

I'll add a test case because it's expected to work ;) . This is also the reason why this patch
does not produce {code}PhraseQuery{code} for synonyms.   For simple "side paths" this is easy
to do but we would need to switch to Span queries for "side paths on side paths" so I though
that it could be done in another issue. 

It seems like you don't need to be using Term here, except at the end to pass to the newXXXQuery,
since everything is in a single field here, and we are hoping to move away from Term entirely

Thanks, I'll simplify the patch.

Holes are challenging for graph token streams ... can you add a test case that encounters
holes, e.g. simulated StopFilter? There are at least two "fun" cases: a hole that cuts the
graph entirely into two partitions, and a synonym spanning over a hole ... CannedTokenStream
is useful for feeding such "interesting" cases.

I though that holes would not be a problem for boolean queries but now I am not sure. I'll
test that.

The seems to be used only for tie-breaking on compare, not for lookup in the TreeSet
as the comment suggests?

The comment is misleading. It is needed because I use {code}TreeSet#remove{code} which uses
compare to check object equality. So the {code}{code} is the unique identifier for
the path.

> Optimize graph query produced by QueryBuilder
> ---------------------------------------------
>                 Key: LUCENE-7638
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>         Attachments: LUCENE-7638.patch
> The QueryBuilder creates a graph query when the underlying TokenStream contains token
with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where synonyms can
span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph TokenStream.
For instance if you have a synonym like "ny, new york" and you search for "new york city",
the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient to build
a boolean query that merges all the intersections in the graph. So instead of "new york city",
"ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each intersection in the
graph. This is not done in this patch but we could also create a specialized query that gives
equivalent scores to multi terms synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could also use
the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple muli-terms
synonyms are present in a single request. Could you take a look ?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message