lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Hind (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6968) LSH Filter
Date Tue, 08 Jan 2019 14:51:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737192#comment-16737192
] 

Andy Hind commented on LUCENE-6968:
-----------------------------------

[~mayyas]     Hi Mayya, there is a good review paper here [https://arxiv.org/pdf/1408.2927.pdf].
 See sections 3.5.1 and 3.5.2 and related references. I have not found the specific comment
about bias I was trying to locate.

The handwaving view is that empty or missing hashes are biased for many to many comparisons.
It is difficult to tune the hash parameters for a wide mix of doc sizes, and small documents
in particular, as the number of hashes increases with doc size over some range. It is better
to have some value rather than none. There is an argument about what value should be used
but that is less important. Repetition is one way of filling in gaps and making the hash count
consistent. For two small docs, there is going to be a bit of asymmetry in the measure whatever
you do. In some cases, like containment, the bias may be a good thing :)

Apologies for my slow response.

> LSH Filter
> ----------
>
>                 Key: LUCENE-6968
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6968
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Cao Manh Dat
>            Assignee: Tommaso Teofili
>            Priority: Major
>             Fix For: 6.2, 7.0
>
>         Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, LUCENE-6968.6.patch, LUCENE-6968.patch,
LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given document. Similarity
measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine library written
entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message