mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dipesh <dipshres...@gmail.com>
Subject Re: Text clustering
Date Sat, 06 Dec 2008 04:54:05 GMT
Hi Philippe,

I'm also doing some work on text clustering with feature extraction. For
text clustering the Cosine Distance is considered a better Similarity
metrics than the Eucledian Distance Measure. I couldn't find
CosineDistanceMeasure in Mahout, did u use Cosine Distance Measure in your
clustering project?

regards,
Dipesh

On Fri, Dec 5, 2008 at 11:45 PM, Philippe Lamarche <
philippe.lamarche@gmail.com> wrote:

> I will try to do the same.
>
> On Fri, Dec 5, 2008 at 8:40 AM, Grant Ingersoll <gsingers@apache.org>
> wrote:
>
> >
> > On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:
> >
> >  Sure :-) I haven't got my project on me at the moment but should be able
> >> to
> >> get at it some time before Xmas so will look through it again and send
> you
> >> anything that may be useful.
> >>
> >
> > Cool, just add a patch to JIRA, if you can.  I think we could work
> together
> > to create a Text Clustering "example".
> >
> >
> >
> >
> >>
> >>
> >> 2008/12/5 Grant Ingersoll <gsingers@apache.org>
> >>
> >>  I seem to recall some discussion a while back about being able to add
> >>> labels to the vectors/matrices, but I don't know the status of the
> patch.
> >>>
> >>> At any rate, very cool that you are using it for text clustering.  I
> >>> still
> >>> have on my list to write up how to do this and to write some supporting
> >>> code
> >>> as well.  So, if either of you cares to contribute, that would be most
> >>> useful.
> >>>
> >>> -Grant
> >>>
> >>>
> >>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
> >>>
> >>> Hi Phillippe,
> >>>
> >>>>
> >>>> I used the K-Means on TF-IDF vectors and wondered the same thing -
> about
> >>>> labelling the documents. I haven't got my code on me at the moment and
> >>>> it
> >>>> was a few months ago that I last looked at it (so I was also probably
> >>>> using
> >>>> an older version of Mahout)... but I seem to remember that I did just
> as
> >>>> you
> >>>> are suggesting and simply attached a unique ID to each document which
> >>>> got
> >>>> passed through the map-reduce stages. This requires a bit of tinkering
> >>>> with
> >>>> the K-Means implementation but shouldn't be too much work.
> >>>>
> >>>> As for having massive vectors, you could try representing them as
> sparse
> >>>> vectors rather than the dense vectors the standard Mahout K-Means
> >>>> algorithm
> >>>> accepts, which gets rid of all the zero values in the document
> vectors.
> >>>> See
> >>>> the Javadoc for details, it'll be more reliable than my memory :-)
> >>>>
> >>>> Richard
> >>>>
> >>>>
> >>>> 2008/12/3 Philippe Lamarche <philippe.lamarche@gmail.com>
> >>>>
> >>>> Hi,
> >>>>
> >>>>>
> >>>>> I have a questions concerning text clustering and the current
> >>>>> K-Means/vectors implementation.
> >>>>>
> >>>>> For a school project, I did some text clustering with a subset of
the
> >>>>> Enron
> >>>>> corpus. I implemented a small M/R package that transforms text into
> >>>>> TF-IDF
> >>>>> vector space, and then I used a little modified version of the
> >>>>> syntheticcontrol K-Means example. So far, all is fine.
> >>>>>
> >>>>> However, the output of the k-mean algorithm is vector, as is the
> input.
> >>>>> As
> >>>>> I
> >>>>> understand it, when text is transformed in vector space, the
> >>>>> cardinality
> >>>>> of
> >>>>> the vector is the number of word in your global dictionary, all
word
> in
> >>>>> all
> >>>>> text being clustered. This, can grow up pretty quick. For example,
> with
> >>>>> only
> >>>>> 27000 Enron emails, even when removing word that only appears in
2
> >>>>> emails
> >>>>> or
> >>>>> less, the dictionary size is about 45000 words.
> >>>>>
> >>>>> My number one problem is this: how can we find out what document
a
> >>>>> vector
> >>>>> is
> >>>>> representing, when it comes out of the k-means algorithm? My favorite
> >>>>> solution would be to have a unique id attached to each vector. Is
> there
> >>>>> such
> >>>>> ID in the vector implementation? Is there a better solution? Is
my
> >>>>> approach
> >>>>> to text clustering wrong?
> >>>>>
> >>>>> Thanks for the help,
> >>>>>
> >>>>> Philippe.
> >>>>>
> >>>>>
> >>>>>  --------------------------
> >>> Grant Ingersoll
> >>>
> >>> Lucene Helpful Hints:
> >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >>> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> > --------------------------
> > Grant Ingersoll
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>



-- 
----------------------------------------
"Help Ever Hurt Never"- Baba

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message