lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Initial work on multi word synonyms and phrase queries
Date Thu, 18 Jun 2015 10:02:09 GMT
+1 to opening an issue, thanks for exploring this!  It's hairy :)

Your windows test failures complaining about FSTOrd50 missing is
curious ... I don't run Windows but maybe someone who does has an
idea?  That postings format comes from lucene/codecs which should be
on the class path during tests...

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jun 17, 2015 at 10:21 PM, Robert Muir <rcmuir@gmail.com> wrote:
> Hey, thanks for tackling this! That synonymfilter is a beast...
>
> Can you open a JIRA issue with your patch?
>
> To me the interesting part is this change in the test:
>
>           if (posInc > 0) {
>             // This token increments position, so it is starting a new position.
>             // Its position is the last position plus the posLength of the
>             // last token that started a position.
>             pos += lastPosLength;
>             lastPosLength = posLength;
>           }
>
> This currently implies some change to how posInc/posLen are treated on
> the consumer side: it would need changes to queryparsers and
> indexwriter to work (which is fine, we could figure out those
> semantics). But its my understanding this logic might be based on some
> properties specific to synonymfilter being greedy, and not really
> general to all streams. So maybe it synonymfilter or some other filter
> needs to do this adjustment internally instead.
>
> Anyway, I think we should make an issue and investigate it.
>
> On Wed, Jun 17, 2015 at 9:56 PM, Ian <ianribas@hotmail.com> wrote:
>> Hello,
>>
>> Some time ago, I had a problem with synonyms and phrase type queries
>> (actually, it was elasticsearch and I was using a match query with multiple
>> terms and the "and" operator, as better explained here:
>> https://github.com/elastic/elasticsearch/issues/10394).
>>
>> That issue led to some work on Lucene:
>> https://issues.apache.org/jira/browse/LUCENE-6400 (where I helped a little
>> with tests) and  https://issues.apache.org/jira/browse/LUCENE-6401. This
>> issue is also related to https://issues.apache.org/jira/browse/LUCENE-3843.
>>
>> Starting from the discussion on LUCENE-6400, I'm attempting to implement a
>> solution. Here is a patch with a first step - the implementation to fix
>> "SynFilter to be able to 'make positions'" (as was mentioned on the issue).
>> In this way, the synonym filter generates a correct (or, at least, better)
>> graph.
>>
>> As the synonym matching is greedy, I only had to worry about fixing the
>> position length of the rules of the current match, no future or past
>> synonyms would "span" over this match (please correct me if I'm wrong!). It
>> did require more buffering, twice as much.
>>
>> The new behavior I added is not active by default, a new parameter has to be
>> passed in a new constructor for SynonymFilter. The changes I made do change
>> the token stream generated by the synonym filter, and I thought it would be
>> better to let that be a voluntary decision for now.
>>
>> I did some refactoring on the code, but mostly on what I had to change for
>> may implementation, so that the patch was not too hard to read. I created
>> specific unit tests for the new implementation (TestMultiWordSynonymFilter)
>> that should show how things will be with the new behavior.
>>
>> Speaking of tests, I ran "analysis-common" tests locally (windows 8, java
>> 8), and had only 2 unrelated failures (as far as I can tell) complaining of
>> missing PostingsFormat "FSTOrd50".
>>
>> Thanks for any help, comment, adjustment on the patch. I'll do my best to
>> make the necessary adjustments.
>>
>> Please forgive me if I did not follow any rule, of the code or of the list,
>> and I would be grateful to be able to learn from my mistakes.
>>
>> Regards,
>> Ian
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message