lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Jha <shanuu....@gmail.com>
Subject De Duplication using Solr
Date Sat, 03 Jan 2015 07:54:17 GMT
I am trying to find out duplicate records based on distance and phonetic
algorithms. Can I utilize solr for that? I have following fields and
conditions to identify exact or possible duplicates.

1. Fields
prefix
suffix
firstname
lastname
email(primary_email1, email2, email3)
phone(primary_phone1, phone2, phone3)
2. Conditions:
Two records said to be exact duplicates if

1. IsExactMatchFunction(record1_prefix, record2_prefix) AND
IsExactMatchFunction(record1_suffix, record2_suffix) AND
IsExactMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
Two records said to be possible duplicates if

1. IsExactMatchFunction(record1_prefix, record2_prefix) OR
IsExactMatchFunction(record1_suffix, record2_suffix) OR
IsExactMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
 ELSE
 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
 ELSE
 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_any_email,record2_any_email) OR
IsExactMatchFunction(record1_any_phone,record2_any_primary)

IsFuzzyMatchFunction() will perform distance and phonetic algorithms
calculation and compare it with predefined threshold.

For example:

if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function
only return "ture" only and only if one of the algorithms(distance or
phonetic) return the similarity socre >= 85.

Can I use solr to perform this job. Or Can you guys suggest how can I
approach to this problem. I have seen the duke(De duplication API) but I
can not use duke out of the box.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message