lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Ferenczi (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder
Date Mon, 16 Jan 2017 17:31:26 GMT


Jim Ferenczi commented on LUCENE-7638:

For pure phrase query it's a good fit because it's a proximity query but for boolean queries
the problem is different. We cannot build the TermAutomatonQuery directly, first we need to
find the start and end state of each multi-term synonyms in the graph. That's what the attached
patch is doing lazily, for each intersection point it creates a multi-term synonym query.
Currently the multi-term synonym query is a boolean query but we could change the logic and
use the TermAutomatonQuery instead or even create a PhaseQuery for each path in the multi-term
synonym. This patch also handles nested multi-term synonyms which makes the detection of intersection
points harder. 
Bottom point is that if we are able to extract the multi-term synonyms of the graph then we
can choose more easily how we want to search and score these inner graph. Does this makes
sense ?

> Optimize graph query produced by QueryBuilder
> ---------------------------------------------
>                 Key: LUCENE-7638
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>         Attachments: LUCENE-7638.patch
> The QueryBuilder creates a graph query when the underlying TokenStream contains token
with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where synonyms can
span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph TokenStream.
For instance if you have a synonym like "ny, new york" and you search for "new york city",
the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient to build
a boolean query that merges all the intersections in the graph. So instead of "new york city",
"ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each intersection in the
graph. This is not done in this patch but we could also create a specialized query that gives
equivalent scores to multi terms synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could also use
the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple muli-terms
synonyms are present in a single request. Could you take a look ?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message