lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tommaso Teofili (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6954) More Like This Query: keep fields separated
Date Sat, 26 Mar 2016 14:51:25 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213066#comment-15213066
] 

Tommaso Teofili commented on LUCENE-6954:
-----------------------------------------

a couple of comments on your patch:
1. in {{MLT#createQueue}} you iterate over the key set of the _field2termFreqMap_ map (and
same later for _word2termFrequency_ map), then you get values for each key; it's [usually
preferred|http://stackoverflow.com/questions/3870064/performance-considerations-for-keyset-and-entryset-of-map]
to iterate over the entry set instead.
2. the names _field2TermFreqMap_ and _word2termFrequency_ don't sound too nice to me, maybe
_perFieldTermFrequencies_ and _perWordTermFrequencies_ sound slightly better.
3. in the test the static fields are public, you can safely (and rather) keep them private

Other than that the patch looks good to me.

> More Like This Query: keep fields separated
> -------------------------------------------
>
>                 Key: LUCENE-6954
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6954
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/other
>    Affects Versions: 5.4
>            Reporter: Alessandro Benedetti
>            Assignee: Tommaso Teofili
>              Labels: morelikethis
>         Attachments: LUCENE-6954.patch
>
>
> Currently the query is generated : 
> org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> 1) we extract the terms from the interesting fields, adding them to a map :
> Map<String, Int> termFreqMap = new HashMap<>();
> ( we lose the relation field-> term, we don't know anymore where the term was coming
! )
> org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> 2) we build the queue that will contain the query terms, at this point we connect again
there terms to some field, but :
> ...
> // go through all the fields and find the largest document frequency
> String topField = fieldNames[0];
> int docFreq = 0;
> for (String fieldName : fieldNames) {
>   int freq = ir.docFreq(new Term(fieldName, word));
>   topField = (freq > docFreq) ? fieldName : topField;
>   docFreq = (freq > docFreq) ? freq : docFreq;
> }
> ...
> We identify the topField as the field with the highest document frequency for the term
t .
> Then we build the termQuery :
> queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));
> In this way we lose a lot of precision.
> Not sure why we do that.
> I would prefer to keep the relation between terms and fields.
> The MLT query can improve a lot the quality.
> If i run the MLT on 2 fields : weSell and weDontSell for example.
> It is likely I want to find documents with similar terms in the weSell and similar terms
in the weDontSell, without mixing up the things and loosing the semantic of the terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message