lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: enhancement for SynonymFilter
Date Fri, 18 Nov 2016 07:58:29 GMT
Hi Mike,

let me explain.

First, after looking deeper inside I noticed that the Filters are used
like a stack and called backwards. So the first incrementToken goes
to the last filter in the chain. That one also uses incrementToken and
and calls its predecessor in the chain and so on.
So everything following after SynonymFilter in the chain only gets its
"knowledge" from the token and its attributes. As result of this, there
is no sense of a "hasSynonyms" function in SynonymFilter. The only
solution would be another token attribute and my first assumtion was wrong.

Second, was has changed between 4.10.4 and 6.3.0?
In 4.10.4 SynonymFilter "produced" SYNONYMS which also contained the original
Token and the first synonym in line had positionIncrement set.
synonym.txt: bar, foo, foo\ bar, baz
IN: foo(shingle)posInc=1
OUT: foo(shingle)posInc=1, foo(SYNONYM)posInc=1 "foo bar"(SYNONYM)posInc=0, baz(SYNONYM)posInc=0

In 6.3.0 the output is different.
IN: foo(shingle)posInc=1
OUT: foo(shingle)posInc=1, "foo bar"(SYNONYM)posInc=0, baz(SYNONYM)posInc=0

In 4.10.4 we just dropped the shingles and everything was fine.
The positionIncrement was correct and the ingoing shingle which generated the SYNONYMs
was also included as SYNONYM, because it can also be named a SYNONYM as it is equal
to all other synonyms in synonym.txt.

Now in 6.3.0 this is quite difficult and not as easy as it was.
- I can't drop all shingles.
- Because of this kinf of stack calling of the filters I can't predict if a
  shingle produced SYNONYMS.

Either I have a token attribute which tells me that the shingle coming out of
SynonymFilter has produced SYNONYMS (and I should not drop it because it is
not in SYNONYM result anymore),
Or I have to use caching, wait until incrementToken returns false and then
parse through all results and clean up.

Because of this backwards calling (stack building) of filters I would
suggest another token attribute which tells me if something going into
SynonymFilter has produced SYNONYMs, which will follow next.

What do you think, any other idea?
As I mentioned, this is for a special solution and probably not very common.

Regards
Bernd


Am 17.11.2016 um 22:17 schrieb Michael McCandless:
> Hmm are you saying SynonymFilter in 4.10.4 has this capability but
> 6.3.0 lost it?
> 
> So you you have a synonym "wow that's funny" -> "wtf", you want the
> token for "wow" to state that it has a synonym?
> 
> Using the PositionLengthAttribute you should be able to reconstruct
> this, because when you see "wtf' with position length 3, you know it
> spanned "wow", "that's", "funny".
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Nov 17, 2016 at 10:22 AM, Bernd Fehling
> <bernd.fehling@uni-bielefeld.de> wrote:
>> Currently I'm tackling a problem with SynonymFilter while going from 4.10.4 to 6.3.0.
>>
>> For a special solution I need to know if a word (or multiword) is producing
>> synonyms in SynonymFilter.
>>
>> Therefore I suggest the enhancement of "hasSynonyms" for SynonymFilter.
>>
>> A workaroud would be to buffer all results from SynonymFilter and check if
>> after a word or multiword (of any type) is the next one a SYNONYM.
>>
>> A function "hasSynonyms" in SynonymFilter would make things easy :-)
>>
>> What do you think about this?
>>
>> Regards
>> Bernd
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message