lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namgyu Kim <kng0...@gmail.com>
Subject Re: Best fuzzy match on multiple terms
Date Thu, 13 Jun 2019 16:26:34 GMT
Dear Matthias,

First you need to know about the Lucene's ranking concept.
Lucene's basic ranking is BM25 and it depends on your index status.
(https://en.wikipedia.org/wiki/Okapi_BM25)
There can be many reasons.
One of thing that I can guess is your index has a lot of 'rozi' term so it
is getting worthless.
It is called IDF(Inverse Document Frequency).
Anyway, if you want to be a micro controller, you need to understand the
BM25 expression.

And Lucene can tell you how your score came out.
Explanation can be used to get it.
I attach the sample code.
======================================
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(q, hitsPerPage);
ScoreDoc[] hits = docs.scoreDocs;

for (int i = 0; i < hits.length; ++i) {
  int docId = hits[i].doc;
  Explanation explanation = searcher.explain(q, docId);
  // You can see how the score is calculated
  System.out.println("Explanation : " + explanation.toString());
}
======================================

I hope it helps :D

Best regards,
Namgyu Kim

P.S. For BM25, the default value in Lucene is k1 = 1.2, b = 0.75.

2019년 6월 14일 (금) 오전 12:54, <baris.kazar@oracle.com>님이 작성:

> i would suggest trying (indexing and searching) without === ' === s and
> see You can find it first.
>
> Thanks
>
>
> On 6/13/19 11:25 AM, Matthias Müller wrote:
> > I am currently matching botanic names (with possible mis-spellings)
> > against an indexed referenced list with Lucene. After quick progress in
> > the beginning, I am struggeling with the proper query design to achieve
> > a ranking result I want.
> >
> > Here is an example:
> >
> > Search term: Acer campestre 'Rozi'
> >
> > Tokenized (decomposed) representation:
> > acer
> > campestre
> > rozi
> >
> > Top 10 hits:
> > {value=Acer campestre, score=12.288989}
> > {value=Acer campestre 'Rozi', score=11.955223} // <- why is it 2nd?
> > {value=Acer campestre 'Arends', score=10.640412}
> > {value=Acer campestre subsp. leiocarpon, score=10.640412}
> > {value=Acer campestre 'Carnival', score=10.640412}
> > {value=Acer campestre 'Commodore', score=10.640412}
> > {value=Acer campestre 'Nanum', score=10.640412}
> > {value=Acer campestre 'Elsrijk', score=10.640412}
> > {value=Acer campestre 'Fastigiatum', score=10.640412}
> > {value=Acer campestre 'Geessink', score=10.640412}]
> >
> >
> > And here is how I create my queries:
> >
> > final BooleanQuery.Builder builder = new BooleanQuery.Builder();
> >    // add individual tokens to query
> >    for (String token : fuzzyTokens) {
> >      final Term term = new Term(NAME_TOKENS.name(), token);
> >      final FuzzyQuery fq = new FuzzyQuery(term);
> >      builder.add(fq, BooleanClause.Occur.SHOULD);
> >    }
> >    return builder.build();
> > }
> >
> >
> > Input names are analyzed with a StandardTokenizer and Lowercase filter
> > when they are added to the IndexWriter.
> >
> >
> > My question: How can I get a ranking that scores
> > "Acer campestre 'Rozi'" higher than "Acer campestre"?
> > I am sure there is an obvious way to achieve this that I have yet
> > failed to find.
> >
> >
> > -Matthias
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message