lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph
Date Mon, 22 Jun 2015 19:21:01 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596490#comment-14596490
] 

Michael McCandless commented on LUCENE-6582:
--------------------------------------------

bq. I really hadn't thought of using position lengths as "references", like this!

It's hard to think about :)  But it "just" means the positions become node IDs, and you must
number the nodes "properly" (so that any token always goes from node X to Y where Y > X).

bq. One problem that I see is that I'll need more buffering 

I think that's fine, I think better correctness trumps the added buffering cost.

bq. One other doubt I have is how this affects the indexer. I imagine it saves position lengths
on the index too, so this shouldn't be a problem, right?

The index does NOT record position length today... I think if we fix syn filter here to produce
the correct graph, we should also insert a "sausagizer" phase that turns this graph back into
a sausage for indexing?  (So that "what the fudge" and "wow that's funny" will in fact match
a document that had "wtf").

However, if you apply syn filter at search time, we could fix query parsers to possibly "do
the right thing" here, e.g. translating this graph into a union of phrase queries, or using
TermAutomatonQuery (in sandbox still), or something ...

> SynonymFilter should generate a correct (or, at least, better) graph
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6582
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6582
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ian Ribas
>         Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, after2.png, after3.png,
before.png
>
>
> Some time ago, I had a problem with synonyms and phrase type queries (actually, it was
elasticsearch and I was using a match query with multiple terms and the "and" operator, as
better explained here: https://github.com/elastic/elasticsearch/issues/10394).
> That issue led to some work on Lucene: LUCENE-6400 (where I helped a little with tests)
and  LUCENE-6401. This issue is also related to LUCENE-3843.
> Starting from the discussion on LUCENE-6400, I'm attempting to implement a solution.
Here is a patch with a first step - the implementation to fix "SynFilter to be able to 'make
positions'" (as was mentioned on the [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
In this way, the synonym filter generates a correct (or, at least, better) graph.
> As the synonym matching is greedy, I only had to worry about fixing the position length
of the rules of the current match, no future or past synonyms would "span" over this match
(please correct me if I'm wrong!). It did require more buffering, twice as much.
> The new behavior I added is not active by default, a new parameter has to be passed in
a new constructor for {{SynonymFilter}}. The changes I made do change the token stream generated
by the synonym filter, and I thought it would be better to let that be a voluntary decision
for now.
> I did some refactoring on the code, but mostly on what I had to change for may implementation,
so that the patch was not too hard to read. I created specific unit tests for the new implementation
({{TestMultiWordSynonymFilter}}) that should show how things will be with the new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message