mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Fuzzy matching
Date Mon, 02 May 2011 16:21:25 GMT
Hi James,

What Soren said :)

The approach I generally use is to create a key with sufficiently broad scope to cast a big
enough net (e.g. by zip code, or by zip + last "real" part of street address).

With Cascading, it's easy to do a self-join using this key to get all of the nxn combinations
of entities with the same key. That way I don't have to worry about in-memory usage processing
all of the key values at once, though it does preclude doing any optimizations such as bailing
on the detailed analysis once a combination's score can't be high enough to get into the top
N.

For each of the combinations, generate a score based on heuristics (that's the pain part Soren
mentioned).

E.g. I'll calculate scores for degree of similarity between street number, street name, city
name, phone number, etc data.

With appropriate weighting for each of these factors, you can get pretty good results.

Then group by key/sort by score and pick the top 1. If the top N have very close scores, you
can emit these for manual processing.

-- Ken

On May 2, 2011, at 8:59am, Soren Flexner wrote:

>  James,
> 
>  If you have names along with those addresses, the top level (blocking)
> approach I would take would be to run through the whole set and create a
> group of tags using, say, last name + house number + zip.  Then you could
> write a MR job where you make those tags the key, and your 'people' will
> automatically be sorted together prior to being sent to the reducers.
> Attempting de-duplication beyond blocking invariably leads to fairly
> sophisticated feature vector creation and machine learning algorithm
> application to get much more mileage.
> 
>  The problem with using any other part of the address (or for that matter
> the first name) is normalization.
> 
>  For example, you'll often see incorrect street direction characters at the
> beginning (or end) of streets ( N. Franklin, when it's really Franklin N.
> etc).
> 
>  Sometimes streets have multiple names.
> 
>  A lot of times, apt building letters can get mixed up into the beginning
> or ending of street names (e.g. 1100 A Franklin St N)
> 
>  In the city field, you'll see Santa Monica, when it's really LA, or the
> other way around.
> 
>  For names, you'll see Bill instead of William -etc.
> 
>  These problems are quite significant in user-generated data.
> 
>  Make the decision early-on as to whether you want to get 80-90% (just use
> blocking) or if you need better precision/recall than that.  If you decide
> to go whole hog, that last 10-20% is a monster, so just be ready for that.
> 
>  -s
> 
> On Fri, Apr 29, 2011 at 3:30 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> 
>> If these are street addresses, brute force will probably get you 99% of the
>> way.
>> 
>> The traditional dedupe for addresses is to take the number plus the first
>> letter of the street name within a city (for smaller cities) or zip code
>> region (possibly zip-4, possibly zip-5).  Sort by this extracted key and do
>> detailed matching on the matches.
>> 
>> From there, standard edit distance with an ever growing pile of heuristics
>> works pretty well.  I have at times built fancy models to mark duplicates
>> but the supply of truly weird special cases that defy modeling seems
>> infinite so that is a pretty unrewarding activity.  You can get the error
>> rate down to a few percent or less pretty quickly using a cycle of sort,
>> apply heuristics, inspect manually, build new heuristics, lather, rinse and
>> repeat.
>> 
>> This process is made to order for map-reduce, but your data is actually not
>> that large for the sorting part even in a conventional database.  Your data
>> is large enough and the map-reduce will be fast enough that using Hadoop
>> may
>> work out well.  Using Hive or Pig would make that much easier especially
>> when you start injecting your infinite supply of heuristic adjustments.
>> 
>> If you really want to do a full everything to everything fuzzy compare, try
>> extracting short sub-strings of 2-4 characters and compute a fuzzy match
>> score based on sub-string counts. Since the address to substring relation
>> is
>> a sparse one, you can do this computation using map-reduce without the full
>> n^2 cost.  You will need to throw away the most common substrings to avoid
>> quadratic explosion, but that doesn't hurt accuracy.
>> 
>> If you meant something else, please holler!
>> 
>> On Fri, Apr 29, 2011 at 12:50 PM, James Pettyjohn <jamesp@scientology.net
>>> wrote:
>> 
>>> 
>>> 
>>> Hey,
>>> 
>>> First time writing in.
>>> 
>>> I have around 6 million active records
>>> in a contacts database. Additional millions of history address records
>> for
>>> these records. I got a new 60+ thousand records which are not correlated
>> to
>>> these that I need to fuzzy match against both active and historical
>>> records.
>>> 
>>> I will need to do the same thing with the database against
>>> itself for de-duplication later. The data is primarily in Oracle (with
>> the
>>> supplement in csv's).
>>> 
>>> I saw the Booz/Allen/Hamilton presentation on fuzzy
>>> matching - but I don't see any distributions for that implementation. At
>>> the same time I don't need real time query - just batch processing at the
>>> moment.
>>> 
>>> I thought Mahout might fit the bill. Any comments on approach
>>> would be appreciated.
>>> 
>>> Best, James
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Mime
View raw message