mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: Any Entity Resolution & Deduplication solution?
Date Tue, 03 Dec 2013 12:32:00 GMT
You don't need clustering for this.

Lucene
 should be able to help you here to create a dictionary. Look at

a) Lucene's CJK and Standard Analyzers
b) Mahout's DictionaryVectorizer (with appropriate Lucene Analyzer)

that along with an appropriate 
choice of ngrams and Stopwords should do it for you.








On Tuesday, December 3, 2013 3:41 AM, Jason Lee <wuaner@gmail.com> wrote:
 
I have 10M+ textual company names(in Chinese) that extracted from work
experiences of user's profile. Because those company names are manually
entered by users of our site, so there are lots of duplication. Our goal is
extracting & cleansing those data to establish a company dictionary. For
example, those terms should considered as one company:

Huawei Technologies Co. Ltd
Huawei
huawei.com
华为                        ->  (华为 is Huawei in Chinese)
华为有限公司                         -> (有限公司 is Co. Ltd in Chinese)

Looks like it's a clustering process, but i don't have any idea how can i
implement it.

Regards.
- Jason
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message