mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lin, Zhiwei" <zhiwei....@sap.com>
Subject Re: Word and Phrase Clustering
Date Fri, 02 Dec 2011 07:32:59 GMT
I suppose you could do so if you use sequence similarity. 
I know that it can be integrated into hierarchical clustering. But it seems that hierarchical
clustering has not become part of mahout.



----- Original Message -----
From: Neil Chaudhuri [mailto:nchaudhuri@potomacfusion.com]
Sent: Friday, December 02, 2011 05:48 AM
To: user@mahout.apache.org <user@mahout.apache.org>
Subject: Re: Word and Phrase Clustering

Glad to fill in more detail. Imagine I have a list of words and phrases in a data store like
this:

Alabama
Obama
University of Alabama
Bama
Potomac
Texas
Potomac River

I would like to cluster the ones that look similar enough to be the same. Like "Alabama" and
"University of Alabama" and "Bama" (but not Obama ideally) or "Potomac" and "Potomac River."


Now this list of words could be in the terabytes range, which is why I need distributed computing
capability.

How would I assemble a Vector from an individual entry in this list? With a bit more understanding
of my situation, do you think Mahout can work for me?

Please let me know if I can provide more information.

Thanks.



On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:

> Could you elaborate a bit on what you mean by "cluster a collection of 
> words and phrases by syntactic similarity over a distributed environment 
> "? If you can describe your collection in terms of a set of (sparse or 
> dense) term vectors then you should be able to use Mahout clustering 
> directly. The vectors do not need to be huge (as "document" might 
> imply), indeed smaller dimensionality clusterings work better than large 
> ones. The question would be how do you plan to encode these vectors? 
> Another would be how large a collection you have?
> 
> On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
>> I have a need to cluster a collection of words and phrases by syntactic similarity
over a distributed environment, and I came upon Mahout as a possible solution. After studying
the documentation though, I am finding all of it tailored to working with entire documents
rather than words and phrases. I simply want to know if you believe that Mahout is the right
tool for this job. I suppose I could try to view each word and phrase as individual tiny documents,
but that feels like I am forcing it.
>> 
>> Any insight is appreciated.
>> 
>> Thanks.
>> 
> 


Mime
View raw message