lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: enhancement for SynonymFilter
Date Fri, 18 Nov 2016 08:28:36 GMT


Am 18.11.2016 um 08:58 schrieb Bernd Fehling:
> Hi Mike,
> 
> let me explain.
> 
> First, after looking deeper inside I noticed that the Filters are used
> like a stack and called backwards. So the first incrementToken goes
> to the last filter in the chain. That one also uses incrementToken and
> and calls its predecessor in the chain and so on.
> So everything following after SynonymFilter in the chain only gets its
> "knowledge" from the token and its attributes. As result of this, there
> is no sense of a "hasSynonyms" function in SynonymFilter. The only
> solution would be another token attribute and my first assumtion was wrong.
> 
> Second, was has changed between 4.10.4 and 6.3.0?
> In 4.10.4 SynonymFilter "produced" SYNONYMS which also contained the original
> Token and the first synonym in line had positionIncrement set.
> synonym.txt: bar, foo, foo\ bar, baz
that was a typo, correct is:
synonym.txt: foo, foo\ bar, baz
> IN: foo(shingle)posInc=1
> OUT: foo(shingle)posInc=1, foo(SYNONYM)posInc=1 "foo bar"(SYNONYM)posInc=0, baz(SYNONYM)posInc=0
> 
> In 6.3.0 the output is different.
> IN: foo(shingle)posInc=1
> OUT: foo(shingle)posInc=1, "foo bar"(SYNONYM)posInc=0, baz(SYNONYM)posInc=0
> 
> In 4.10.4 we just dropped the shingles and everything was fine.
> The positionIncrement was correct and the ingoing shingle which generated the SYNONYMs
> was also included as SYNONYM, because it can also be named a SYNONYM as it is equal
> to all other synonyms in synonym.txt.
> 
> Now in 6.3.0 this is quite difficult and not as easy as it was.
> - I can't drop all shingles.
> - Because of this kinf of stack calling of the filters I can't predict if a
>   shingle produced SYNONYMS.
> 
> Either I have a token attribute which tells me that the shingle coming out of
> SynonymFilter has produced SYNONYMS (and I should not drop it because it is
> not in SYNONYM result anymore),
> Or I have to use caching, wait until incrementToken returns false and then
> parse through all results and clean up.
> 
> Because of this backwards calling (stack building) of filters I would
> suggest another token attribute which tells me if something going into
> SynonymFilter has produced SYNONYMs, which will follow next.
> 
> What do you think, any other idea?
> As I mentioned, this is for a special solution and probably not very common.
> 
> Regards
> Bernd
> 
> 
> Am 17.11.2016 um 22:17 schrieb Michael McCandless:
>> Hmm are you saying SynonymFilter in 4.10.4 has this capability but
>> 6.3.0 lost it?
>>
>> So you you have a synonym "wow that's funny" -> "wtf", you want the
>> token for "wow" to state that it has a synonym?
>>
>> Using the PositionLengthAttribute you should be able to reconstruct
>> this, because when you see "wtf' with position length 3, you know it
>> spanned "wow", "that's", "funny".
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Nov 17, 2016 at 10:22 AM, Bernd Fehling
>> <bernd.fehling@uni-bielefeld.de> wrote:
>>> Currently I'm tackling a problem with SynonymFilter while going from 4.10.4 to
6.3.0.
>>>
>>> For a special solution I need to know if a word (or multiword) is producing
>>> synonyms in SynonymFilter.
>>>
>>> Therefore I suggest the enhancement of "hasSynonyms" for SynonymFilter.
>>>
>>> A workaroud would be to buffer all results from SynonymFilter and check if
>>> after a word or multiword (of any type) is the next one a SYNONYM.
>>>
>>> A function "hasSynonyms" in SynonymFilter would make things easy :-)
>>>
>>> What do you think about this?
>>>
>>> Regards
>>> Bernd
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message