mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: Finding the similarity of documents using Mahout for deduplication
Date Tue, 21 Jul 2009 01:20:27 GMT
How is the hash calculated?

On Mon, Jul 20, 2009 at 1:41 AM, Shashikant Kore<shashikant@gmail.com> wrote:
> You may read about Google's approach for near-duplicates.
>
> http://www2007.org/papers/paper215.pdf
>
> The idea here is to reduce entire document to 64-bit sketch by
> dimension reduction and the compare sketch of two documents to find
> near-duplicates. The key property of the sketch is similar documents
> produce similar sketch.  So, if sketch for two documents differs in
> less than k bits, they are near-duplicates. In their experiment, they
> found k=3 yields best resuls.
>
> --shashi
>
> On Sat, Jul 18, 2009 at 12:56 AM, Jason
> Rutherglen<jason.rutherglen@gmail.com> wrote:
>> I think this comes up fairly often in search apps, duplicate
>> documents are indexed (for example using SimplyHired's search
>> there are 20 of the same job listed from different websites). A
>> similarity score above a threshold would determine the documents
>> are too similar, are duplicates, and therefore can be removed.
>> Is there a recommended Mahout algorithm for this?
>>
>

Mime
View raw message