mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Blechschmidt <Manuel.Blechschm...@gmx.de>
Subject Re: Any Entity Resolution & Deduplication solution?
Date Tue, 03 Dec 2013 19:28:12 GMT
Hi Jason,
mahout does not have any direct duplication detection capabilities.

My former university provides a duplication detection library (dude):
http://www.hpi.uni-potsdam.de/naumann/projekte/dude_duplicate_detection.html

If you want to tag entities you might want to look into GATE.
http://gate.ac.uk/sale/talks/stupidpoint/diana-fb.ppt‎


Hope that helps
    Manuel

On 03.12.2013, at 09:41, Jason Lee wrote:

> I have 10M+ textual company names(in Chinese) that extracted from work
> experiences of user's profile. Because those company names are manually
> entered by users of our site, so there are lots of duplication. Our goal is
> extracting & cleansing those data to establish a company dictionary. For
> example, those terms should considered as one company:
> 
> Huawei Technologies Co. Ltd
> Huawei
> huawei.com
> 华为                        ->  (华为 is Huawei in Chinese)
> 华为有限公司                         -> (有限公司 is Co. Ltd in Chinese)
> 
> Looks like it's a clustering process, but i don't have any idea how can i
> implement it.
> 
> Regards.
> - Jason

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B


Mime
View raw message