mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Word and Phrase Clustering
Date Fri, 02 Dec 2011 17:53:22 GMT
Here is an ancient article on the subject.

http://www.aclweb.org/anthology-new/J/J92/J92-3004.pdf

You don't need fancy computer capabilities to cluster words based on
spelling.

On Fri, Dec 2, 2011 at 3:36 AM, Pascal Coupet <pcoupet@gmail.com> wrote:

> Hi Neil,
>
> I suggest you to start by doing clustering on lexical affinities (based on
> how words look). It seems that it's what you are looking for from your
> examples. To cluster terms this way you don't really need to use the full
> data. You can remove all duplicates and get hopefully a much smaller set.
>
> A good way to describe terms for this usage is to use ngrams. You can also
> use phonetic transcriptions of terms. An interesting trick that works well
> is to add a special character at the beginning of each work (in the ngrams
> method). This will boost similarity on the beginning of words which is
> usually good.
>
> I suggest you to have a look at Google
> Refine<http://code.google.com/p/google-refine/>.
> Watch the first video. It demonstrate nice terms clustering capabilities
> using different methods (ngrams, ...). If it's what you are looking for,
> you can try it on the most frequent terms in your dataset and get quickly
> interesting results and then implement the way which look the best for you.
>
> Best,
>
> Pascal
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2011/12/2 Neil Chaudhuri <nchaudhuri@potomacfusion.com>
>
> > Glad to fill in more detail. Imagine I have a list of words and phrases
> in
> > a data store like this:
> >
> > Alabama
> > Obama
> > University of Alabama
> > Bama
> > Potomac
> > Texas
> > Potomac River
> >
> > I would like to cluster the ones that look similar enough to be the same.
> > Like "Alabama" and "University of Alabama" and "Bama" (but not Obama
> > ideally) or "Potomac" and "Potomac River."
> >
> > Now this list of words could be in the terabytes range, which is why I
> > need distributed computing capability.
> >
> > How would I assemble a Vector from an individual entry in this list? With
> > a bit more understanding of my situation, do you think Mahout can work
> for
> > me?
> >
> > Please let me know if I can provide more information.
> >
> > Thanks.
> >
> >
> >
> > On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:
> >
> > > Could you elaborate a bit on what you mean by "cluster a collection of
> > > words and phrases by syntactic similarity over a distributed
> environment
> > > "? If you can describe your collection in terms of a set of (sparse or
> > > dense) term vectors then you should be able to use Mahout clustering
> > > directly. The vectors do not need to be huge (as "document" might
> > > imply), indeed smaller dimensionality clusterings work better than
> large
> > > ones. The question would be how do you plan to encode these vectors?
> > > Another would be how large a collection you have?
> > >
> > > On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
> > >> I have a need to cluster a collection of words and phrases by
> syntactic
> > similarity over a distributed environment, and I came upon Mahout as a
> > possible solution. After studying the documentation though, I am finding
> > all of it tailored to working with entire documents rather than words and
> > phrases. I simply want to know if you believe that Mahout is the right
> tool
> > for this job. I suppose I could try to view each word and phrase as
> > individual tiny documents, but that feels like I am forcing it.
> > >>
> > >> Any insight is appreciated.
> > >>
> > >> Thanks.
> > >>
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message