mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Fuzzy matching
Date Sun, 01 May 2011 06:34:36 GMT
Interesting point.  I hadn't noticed.

On the other hand, if they get their deduping in order, maybe we won't get
as much duplicated junk mail from them.

On Sat, Apr 30, 2011 at 11:21 PM, Patrick Collins <
patrick.collins@ready2sign.com> wrote:

> Should I be worried that somebody with a scientology.net email address is
> writing in about address harvesting and data deduping?
>
> Patrick.
>
> On Fri, Apr 29, 2011 at 12:50 PM, James Pettyjohn <jamesp@scientology.net
> >wrote:
>
> >
> >
> > Hey,
> >
> > First time writing in.
> >
> > I have around 6 million active records
> > in a contacts database. Additional millions of history address records
> for
> > these records. I got a new 60+ thousand records which are not correlated
> to
> > these that I need to fuzzy match against both active and historical
> > records.
> >
> > I will need to do the same thing with the database against
> > itself for de-duplication later. The data is primarily in Oracle (with
> the
> > supplement in csv's).
> >
> > I saw the Booz/Allen/Hamilton presentation on fuzzy
> > matching - but I don't see any distributions for that implementation. At
> > the same time I don't need real time query - just batch processing at the
> > moment.
> >
> > I thought Mahout might fit the bill. Any comments on approach
> > would be appreciated.
> >
> > Best, James
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message