lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: "Similarity" of numbers in MoreLikeThisHandler
Date Fri, 04 Jul 2008 21:23:46 GMT

: I didn't realize that subsets were used to evaluate similarity. From your
: example, I assume that the strings: 456 and 123456 are "similar". If I store
: them as integers instead of strings, will Solr/Lucene still use subsets to
: assign similarity?

Strictly speaking MLT opperates on "Terms" ... tf/idf come into play, 
but roughtly speaking the more terms in common the more similar MLT 
considers two docs.  if you use an Analyzer that produces Terms based on 
substrings (like an ngram tokenizer for example) then MLT will consider 
docs similar if they have substrings in common ... if you don't, then it 
won't.

A version of MLT that knows about numeric fields and tries to 
find "numeric similarity" is possible ... but it would be hard to 
generalize ... you might consider 4 similar to 5, but other people 
wouldn't.  even with stats like the min/mean/max/stddev for a field, it 
would be hard to really gauge what the various thresholds should be (for 
the same reasons it's hard to try an right completley generic numeric 
range faceting code) ... it's a lot easier to code specific logic for your 
specific use case.

-Hoss


Mime
View raw message