lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From baris.ka...@oracle.com
Subject Re: Best fuzzy match on multiple terms
Date Fri, 14 Jun 2019 14:41:40 GMT
These are great suggestions, i was going to suggest explain plan of 
query, too.

i really wonder in Your case why 'Rozi' entry does not get higher score.

Is there any effect from " ' " chars?


In my case i have sort of reverse situation:

my query is maink~2 (mains was a special case where i still investigate)

i would expect the second result below to be the first result as it is 
shorter and closest hit and first result to be the second result.

NASHUA in results: MAIN DUNSTABLE NASHUA HILLSBOROUGH NEW HAMPSHIRE 
UNITED STATES in the 0 th result
NASHUA in results: MAIN NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED STATES 
in the 1 th result


Best regards


On 6/14/19 6:45 AM, Matthias Müller wrote:
> Hi Namgyu and Tomoko,
>
> your hint towards Explanation was very helpful and I was not aware of
> this feature.
>
> I have now experimented with different scoring functions and it seems
> that DFISimilarity and BM25Similarity (with lower 'b') produce results
> in the direction I prefer, though not perfect for some cases [1].
>
> The fuzzy term queries probably generate hardly predictable
> similarities on additional fields. These add scores to the overall
> result and also affect normalization.
>
> Positively, the preferred matches are somewhere in the top ranks. So
> maybe rule-based assessment of the top N hits might help me achieve
> what I want.
>
>
> - Matthias
>
>
> [1]:
> "Abelia xgrandiflora" -> "Abelia xgrandiflora 'Wevo1' BELLA DONNA"
> (score=13.7869625)
> instead of the direct match
> "Abelia xgrandiflora" -> "Abelia xgrandiflora" (score=13.74585)
>
> Am Freitag, den 14.06.2019, 16:32 +0900 schrieb Tomoko Uchida:
>> Hi Matthias,
>>
>> What similarity class are you using.
>> Just a guess... but possibly one reason is document (field) length
>> normalization. Generally speaking shorter documents would get higher
>> scores than longer documents.  (I saw that classic TFIDF similarity
>> tends to give much higher scores to shorter documents. Newer version
>> of lucene uses BM25 similarity as default, that moderates the
>> tendency
>> and has a tuning parameter 'b' to control the normalization effect.)
>> See also:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.elastic.co_guide_en_elasticsearch_guide_current_pluggable-2Dsimilarites.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=xgCA5llK_2kxvxRc4arpgbd1rhgRrSkOqD5j57CA-6Q&e=
>>
>> As Namgyu Kim said, explain() API could help you to examine the
>> details.
>>
>> Tomoko
>>
>> 2019年6月14日(金) 1:27 Namgyu Kim <kng0828@gmail.com>:
>>> Dear Matthias,
>>>
>>> First you need to know about the Lucene's ranking concept.
>>> Lucene's basic ranking is BM25 and it depends on your index status.
>>> (https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Okapi-5FBM25&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=EQ--nOw2fv4xC2jDVd61qmWey2RW5y71Jx5-esA5Epo&s=3M7Yh2-tiEHd8DVhJc5fBeVfE65WvnaXsphnx2pCdfg&e=)
>>> There can be many reasons.
>>> One of thing that I can guess is your index has a lot of 'rozi'
>>> term so it
>>> is getting worthless.
>>> It is called IDF(Inverse Document Frequency).
>>> Anyway, if you want to be a micro controller, you need to
>>> understand the
>>> BM25 expression.
>>>
>>> And Lucene can tell you how your score came out.
>>> Explanation can be used to get it.
>>> I attach the sample code.
>>> ======================================
>>> IndexSearcher searcher = new IndexSearcher(reader);
>>> TopDocs docs = searcher.search(q, hitsPerPage);
>>> ScoreDoc[] hits = docs.scoreDocs;
>>>
>>> for (int i = 0; i < hits.length; ++i) {
>>>    int docId = hits[i].doc;
>>>    Explanation explanation = searcher.explain(q, docId);
>>>    // You can see how the score is calculated
>>>    System.out.println("Explanation : " + explanation.toString());
>>> }
>>> ======================================
>>>
>>> I hope it helps :D
>>>
>>> Best regards,
>>> Namgyu Kim
>>>
>>> P.S. For BM25, the default value in Lucene is k1 = 1.2, b = 0.75.
>>>
>>> 2019년 6월 14일 (금) 오전 12:54, <baris.kazar@oracle.com>님이 작성:
>>>
>>>> i would suggest trying (indexing and searching) without === ' ===
>>>> s and
>>>> see You can find it first.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On 6/13/19 11:25 AM, Matthias Müller wrote:
>>>>> I am currently matching botanic names (with possible mis-
>>>>> spellings)
>>>>> against an indexed referenced list with Lucene. After quick
>>>>> progress in
>>>>> the beginning, I am struggeling with the proper query design to
>>>>> achieve
>>>>> a ranking result I want.
>>>>>
>>>>> Here is an example:
>>>>>
>>>>> Search term: Acer campestre 'Rozi'
>>>>>
>>>>> Tokenized (decomposed) representation:
>>>>> acer
>>>>> campestre
>>>>> rozi
>>>>>
>>>>> Top 10 hits:
>>>>> {value=Acer campestre, score=12.288989}
>>>>> {value=Acer campestre 'Rozi', score=11.955223} // <- why is it
>>>>> 2nd?
>>>>> {value=Acer campestre 'Arends', score=10.640412}
>>>>> {value=Acer campestre subsp. leiocarpon, score=10.640412}
>>>>> {value=Acer campestre 'Carnival', score=10.640412}
>>>>> {value=Acer campestre 'Commodore', score=10.640412}
>>>>> {value=Acer campestre 'Nanum', score=10.640412}
>>>>> {value=Acer campestre 'Elsrijk', score=10.640412}
>>>>> {value=Acer campestre 'Fastigiatum', score=10.640412}
>>>>> {value=Acer campestre 'Geessink', score=10.640412}]
>>>>>
>>>>>
>>>>> And here is how I create my queries:
>>>>>
>>>>> final BooleanQuery.Builder builder = new
>>>>> BooleanQuery.Builder();
>>>>>     // add individual tokens to query
>>>>>     for (String token : fuzzyTokens) {
>>>>>       final Term term = new Term(NAME_TOKENS.name(), token);
>>>>>       final FuzzyQuery fq = new FuzzyQuery(term);
>>>>>       builder.add(fq, BooleanClause.Occur.SHOULD);
>>>>>     }
>>>>>     return builder.build();
>>>>> }
>>>>>
>>>>>
>>>>> Input names are analyzed with a StandardTokenizer and Lowercase
>>>>> filter
>>>>> when they are added to the IndexWriter.
>>>>>
>>>>>
>>>>> My question: How can I get a ranking that scores
>>>>> "Acer campestre 'Rozi'" higher than "Acer campestre"?
>>>>> I am sure there is an obvious way to achieve this that I have
>>>>> yet
>>>>> failed to find.
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>>
>>>>> -------------------------------------------------------------
>>>>> --------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail:
>>>>> java-user-help@lucene.apache.org
>>>>>
>>>> ---------------------------------------------------------------
>>>> ------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message