mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Lee <wua...@gmail.com>
Subject Re: Any Entity Resolution & Deduplication solution?
Date Fri, 06 Dec 2013 10:27:42 GMT
Hi Suneel, Manuel,

Thank you so much for your advices, and sorry for my late reply.

About the methods and library you mentioned, i will try it out and let you
know how it goes and any questions along the way.


On Wed, Dec 4, 2013 at 3:28 AM, Manuel Blechschmidt <
Manuel.Blechschmidt@gmx.de> wrote:

> Hi Jason,
> mahout does not have any direct duplication detection capabilities.
>
> My former university provides a duplication detection library (dude):
>
> http://www.hpi.uni-potsdam.de/naumann/projekte/dude_duplicate_detection.html
>
> If you want to tag entities you might want to look into GATE.
> http://gate.ac.uk/sale/talks/stupidpoint/diana-fb.ppt‎
>
>
> Hope that helps
>     Manuel
>
> On 03.12.2013, at 09:41, Jason Lee wrote:
>
> > I have 10M+ textual company names(in Chinese) that extracted from work
> > experiences of user's profile. Because those company names are manually
> > entered by users of our site, so there are lots of duplication. Our goal
> is
> > extracting & cleansing those data to establish a company dictionary. For
> > example, those terms should considered as one company:
> >
> > Huawei Technologies Co. Ltd
> > Huawei
> > huawei.com
> > 华为                        ->  (华为 is Huawei in Chinese)
> > 华为有限公司                         -> (有限公司 is Co. Ltd in Chinese)
> >
> > Looks like it's a clustering process, but i don't have any idea how can i
> > implement it.
> >
> > Regards.
> > - Jason
>
> --
> Manuel Blechschmidt
> M.Sc. IT Systems Engineering
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message