lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Davis <dansm...@gmail.com>
Subject Re: Customzing Solr Dedupe
Date Wed, 01 Apr 2015 16:50:57 GMT
But you can potentially still use Solr dedupe if you do the upfront work
(in RDMS or NoSQL pre-index processing) to assign some sort of "Group ID".
  See OCLC's FRBR Work-Set Algorithm,
http://www.oclc.org/content/dam/research/activities/frbralgorithm/2009-08.pdf?urlm=161376
, for some details on one such algorithm.

If the job is too big for RDBMS, and/or you don't want to use/have a
suitable NoSQL, you can have two Solr indexes (collection/core/whatever) -
one for classification with only id, field1, field2, field3, and another
for production query.   Then, you put stuff into the classification index,
use queries and your own algorithm to do classification, assigning a
groupId, and then put the document with groupId assigned into the
production database.

A key question is whether you want to preserve the groupId.   In some
cases, you do, and in some cases, it is just an internal signature.   In
both cases, a non-deterministic up-front algorithm can work, but if the
groupId needs to be preserved, you need to work harder to make sure it all
hangs together.

Hope this helps,

-Dan

On Wed, Apr 1, 2015 at 7:05 AM, Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> Solr dedupe is based on the concept of a signature - some fields and rules
> that reduce a document into a discrete signature, and then checking if that
> signature exists as a document key that can be looked up quickly in the
> index. That's the conceptual basis. It is not based on any kind of field by
> field comparison to all existing documents.
>
> -- Jack Krupansky
>
> On Wed, Apr 1, 2015 at 6:35 AM, thakkar.aayush <thakkar.aayush@gmail.com>
> wrote:
>
> > I'm facing a challenges using de-dupliation of Solr documents.
> >
> > De-duplicate is done using TextProfileSignature with following
> parameters:
> > <str name="fields">field1, field2, field3</str>
> > <str name="quantRate">0.5</str>
> > <str name="minTokenLen">3</str>
> >
> > Here Field3 is normal text with few lines of data.
> > Field1 and Field2 can contain upto 5 or 6 words of data.
> >
> > I want to de-duplicate when data in field1 and field2 are exactly the
> same
> > and 90% of the lines in field3 is matched to that in another document.
> >
> > Is there anyway to achieve this?
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Customzing-Solr-Dedupe-tp4196879.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message