Thanks to Adrien for responding. I performed the explain on indexSearcher
in Lucene 5.4.1, the results are pasted below for Basti Bosan (the highest
ranked result) and Boston (the preferred result).
I'm not 100% sure how to interpret this based on (a) lucene's weighting of
the term in the document, and (b) the impact of my Sort preference of
FieldWeight followed by PopulationSort. However, it appears that Basti
Bosan rises higher in the results with a score that's the sum of 0.6737946
(Basti compared to bostn~) + 0.8983928 (Bosan compared to bostn~). Without
completely understanding the changes that took place resulting
from LUCENE329, it appears that Basti Bosan (as a multiterm result) has
an advantage in that the sum of both term weights is enough to get it over
the top.
This result makes sense if I intend to fuzzy match a term against a
document with lots of words to retrieve the most relevant result. However,
in this case I'm actually looking for the nearest result to a term.
Therefore documents with multiple terms should be weighted lower on single
term queries. Is there a way to achieve the result I want a nearest match,
rather than the most relevant document?
*Basti Bosan Explanation*
1.5721874 = sum of:
0.6737946 = weight(indexName:basti in 1465524) [BinarySimilarity], result
of:
0.6737946 = score(doc=1465524,freq=1.0), product of:
0.12056191 = queryWeight, product of:
0.6 = boost
8.942057 = idf(docFreq=6728, maxDocs=18929620)
0.02247095 = queryNorm
5.588785 = fieldWeight in 1465524, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
8.942057 = idf(docFreq=6728, maxDocs=18929620)
0.625 = fieldNorm(doc=1465524)
0.8983928 = weight(indexName:bosan in 1465524) [BinarySimilarity], result
of:
0.8983928 = score(doc=1465524,freq=1.0), product of:
0.16074921 = queryWeight, product of:
0.8 = boost
8.942057 = idf(docFreq=6728, maxDocs=18929620)
0.02247095 = queryNorm
5.588785 = fieldWeight in 1465524, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
8.942057 = idf(docFreq=6728, maxDocs=18929620)
0.625 = fieldNorm(doc=1465524)
*Boston Explanation*
1.4374286 = sum of:
1.4374286 = weight(indexName:bostan in 647770) [BinarySimilarity], result
of:
1.4374286 = score(doc=647770,freq=1.0), product of:
0.16074921 = queryWeight, product of:
0.8 = boost
8.942057 = idf(docFreq=6728, maxDocs=18929620)
0.02247095 = queryNorm
8.942057 = fieldWeight in 647770, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
8.942057 = idf(docFreq=6728, maxDocs=18929620)
1.0 = fieldNorm(doc=647770)
Jeremy M Glesner
Chief Technology Officer
Berico Technologies, LLC.
11130 Sunrise Valley Drive, Suite 300
Reston, VA 20191
703.731.6984 (m)
703.390.9926 x2014 (o)
www.bericotechnologies.com
On Fri, Apr 22, 2016 at 3:53 AM, Adrien Grand <jpountz@gmail.com> wrote:
> FuzzyQuery scoring was changen in Lucene 5.3:
> https://issues.apache.org/jira/browse/LUCENE329
>
> Maybe look at the result of IndexSearcher.explain to understand why the
> "Boston" doc got a lower score than you "Basti Bosan" doc?
>
> Le jeu. 21 avr. 2016 à 15:39, Jeremy Glesner <
> jeremy@bericotechnologies.com>
> a écrit :
>
> > Hello,
> >
> > I'm witnessing a change in behavior between Lucene 4.9 and 5.4.1 that I
> > don't quite understand.
> > I'd like to track down what's happening under the hood. I'm working to
> > update the dependencies of an open source geospatial resolution tool (
> > https://github.com/BericoTechnologies/CLAVIN), which uses Lucene. I've
> > indexed the geonames.org database using both Lucene 4.9 and 5.4.1. We
> > index on the Population of each city for later sorting on query.
> >
> > When running a fuzzy query "bostn~" with Occur.MUST in 4.9, we get the
> > expected result of Boston, where 6793534 is a boosted population. Here
> is
> > the scoreDoc.toString():
> >
> > *Boston: doc=19586055 score=NaN shardIndex=1 fields=[2.971942, 6793534]*
> >
> > However, using 5.4.1, the fuzzy match with Occur.MUST returns "Basti
> Bosan"
> > and "Boston Basin", both of which have a population of zero before
> > returning Boston.
> >
> > *Basti Bosan: doc=11707183 score=NaN shardIndex=0 fields=[1.5721874, 0]*
> >
> >
> > *Boston Basin: doc=12728320 score=NaN shardIndex=0 fields=[1.5721874,
> > 0]Boston: doc=17515475 score=NaN shardIndex=0 fields=[1.4374285,
> 6793534]*
> >
> > I'm wondering if something with the FIELD_SCORE calculation changed
> between
> > 4.9 and 5.4.1, or perhaps I've done something incorrect in building the
> > index, etc.
> >
> > It's worth mentioning that for this test I have built an index w/ both
> 4.9
> > and 5.4.1 using the same geonames database to ensure consistency. Also,
> > sort is set up with both versions in the same way:
> >
> > *private static final Sort POPULATION_SORT = new Sort(new SortField[] {
> > SortField.FIELD_SCORE, *
> > * new SortedNumericSortField(SORT_POP.key(), SortField.Type.LONG, true) *
> > *});*
> >
> > With regard to building the index, in 4.9, we added the population sort
> > field to the index like so:
> >
> > *doc.add(new LongField(SORT_POP.key(), geoName.getPopulation(),
> > Field.Store.YES));*
> >
> > Because you can't sort on docValue = NONE anymore, in 5.4.1, we now add
> it
> > like this:
> >
> > *doc.add(new LongField(SORT_POP.key(), geoName.getPopulation(),
> > LONG_FIELD_TYPE_STORED_SORTED));*
> >
> > where LONG_FIELD_TYPE_STORED_SORTED is:
> >
> >
> > *private static final FieldType LONG_FIELD_TYPE_STORED_SORTED = new
> > FieldType();*
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *static { LONG_FIELD_TYPE_STORED_SORTED.setTokenized(false);
> > LONG_FIELD_TYPE_STORED_SORTED.setOmitNorms(true);
> > LONG_FIELD_TYPE_STORED_SORTED.setIndexOptions(IndexOptions.DOCS);
> > LONG_FIELD_TYPE_STORED_SORTED
> >
> >
> .setNumericType(FieldType.NumericType.LONG);LONG_FIELD_TYPE_STORED_SORTED.setStored(true);LONG_FIELD_TYPE_STORED_SORTED.setDocValuesType(DocValuesType.NUMERIC);LONG_FIELD_TYPE_STORED_SORTED.freeze();}*
> >
> > I would greatly appreciate any insights here; and I'm happy to answer
> > questions to unravel this a bit more. Thank you for your time!
> >
> > V/r,
> > Jeremy
> >
>
