lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: enhancement for SynonymFilter
Date Fri, 18 Nov 2016 14:02:18 GMT
Hmm I didn't realize there was that change in behavior between versions.

But, in 6.3.0, can't you look for a token of type SYNONYM whose
posInc=0 and then know that the previous (posInc>0) token had caused
that synonym?  You just need a bit of caching, until all synonyms for
a given token have been seen (or, maybe just one).  No need to cache
the whole stream of tokens ...

Also, we are looking to replace SynonymFilter with a new
SynonymGraphFilter (https://issues.apache.org/jira/browse/LUCENE-6664)
to fix the notorious yet complex muilti-token synonym bug in Lucene
(see http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html).
I wonder if SynonymGraphFilter makes the processing any easier/harder
for you.

http://blog.mikemccandless.com

On Fri, Nov 18, 2016 at 2:58 AM, Bernd Fehling
<bernd.fehling@uni-bielefeld.de> wrote:
> Hi Mike,
>
> let me explain.
>
> First, after looking deeper inside I noticed that the Filters are used
> like a stack and called backwards. So the first incrementToken goes
> to the last filter in the chain. That one also uses incrementToken and
> and calls its predecessor in the chain and so on.
> So everything following after SynonymFilter in the chain only gets its
> "knowledge" from the token and its attributes. As result of this, there
> is no sense of a "hasSynonyms" function in SynonymFilter. The only
> solution would be another token attribute and my first assumtion was wrong.
>
> Second, was has changed between 4.10.4 and 6.3.0?
> In 4.10.4 SynonymFilter "produced" SYNONYMS which also contained the original
> Token and the first synonym in line had positionIncrement set.
> synonym.txt: bar, foo, foo\ bar, baz
> IN: foo(shingle)posInc=1
> OUT: foo(shingle)posInc=1, foo(SYNONYM)posInc=1 "foo bar"(SYNONYM)posInc=0, baz(SYNONYM)posInc=0
>
> In 6.3.0 the output is different.
> IN: foo(shingle)posInc=1
> OUT: foo(shingle)posInc=1, "foo bar"(SYNONYM)posInc=0, baz(SYNONYM)posInc=0
>
> In 4.10.4 we just dropped the shingles and everything was fine.
> The positionIncrement was correct and the ingoing shingle which generated the SYNONYMs
> was also included as SYNONYM, because it can also be named a SYNONYM as it is equal
> to all other synonyms in synonym.txt.
>
> Now in 6.3.0 this is quite difficult and not as easy as it was.
> - I can't drop all shingles.
> - Because of this kinf of stack calling of the filters I can't predict if a
>   shingle produced SYNONYMS.
>
> Either I have a token attribute which tells me that the shingle coming out of
> SynonymFilter has produced SYNONYMS (and I should not drop it because it is
> not in SYNONYM result anymore),
> Or I have to use caching, wait until incrementToken returns false and then
> parse through all results and clean up.
>
> Because of this backwards calling (stack building) of filters I would
> suggest another token attribute which tells me if something going into
> SynonymFilter has produced SYNONYMs, which will follow next.
>
> What do you think, any other idea?
> As I mentioned, this is for a special solution and probably not very common.
>
> Regards
> Bernd
>
>
> Am 17.11.2016 um 22:17 schrieb Michael McCandless:
>> Hmm are you saying SynonymFilter in 4.10.4 has this capability but
>> 6.3.0 lost it?
>>
>> So you you have a synonym "wow that's funny" -> "wtf", you want the
>> token for "wow" to state that it has a synonym?
>>
>> Using the PositionLengthAttribute you should be able to reconstruct
>> this, because when you see "wtf' with position length 3, you know it
>> spanned "wow", "that's", "funny".
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Nov 17, 2016 at 10:22 AM, Bernd Fehling
>> <bernd.fehling@uni-bielefeld.de> wrote:
>>> Currently I'm tackling a problem with SynonymFilter while going from 4.10.4 to
6.3.0.
>>>
>>> For a special solution I need to know if a word (or multiword) is producing
>>> synonyms in SynonymFilter.
>>>
>>> Therefore I suggest the enhancement of "hasSynonyms" for SynonymFilter.
>>>
>>> A workaroud would be to buffer all results from SynonymFilter and check if
>>> after a word or multiword (of any type) is the next one a SYNONYM.
>>>
>>> A function "hasSynonyms" in SynonymFilter would make things easy :-)
>>>
>>> What do you think about this?
>>>
>>> Regards
>>> Bernd
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message