mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Blechschmidt <>
Subject Re: Any Entity Resolution & Deduplication solution?
Date Tue, 03 Dec 2013 19:28:12 GMT
Hi Jason,
mahout does not have any direct duplication detection capabilities.

My former university provides a duplication detection library (dude):

If you want to tag entities you might want to look into GATE.‎

Hope that helps

On 03.12.2013, at 09:41, Jason Lee wrote:

> I have 10M+ textual company names(in Chinese) that extracted from work
> experiences of user's profile. Because those company names are manually
> entered by users of our site, so there are lots of duplication. Our goal is
> extracting & cleansing those data to establish a company dictionary. For
> example, those terms should considered as one company:
> Huawei Technologies Co. Ltd
> Huawei
> 华为                        ->  (华为 is Huawei in Chinese)
> 华为有限公司                         -> (有限公司 is Co. Ltd in Chinese)
> Looks like it's a clustering process, but i don't have any idea how can i
> implement it.
> Regards.
> - Jason

Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621

View raw message