mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Clustering a large crawl
Date Wed, 30 May 2012 21:29:28 GMT
Using:

  * A random crawl starting from mostly news sites.
  * TFIDF
  * a custom lucene analyzer with stemming, stop words, lowercasing,
    number removal
  * tried bi-grams but didn't like the results, tried large mll of 2000
    but still got too many meaningless ones, will revisit but not all
    that important I think.
  * L2 normalization

I know it has clusters because when I use a random k value I get a lot 
of good clusters and quite a few bad ones.


On 5/30/12 10:39 AM, Robert Stewart wrote:
> What type of documents (news, web, pdf, etc.)?  How are your vectors constructed?  Are
you using TF*IDF on just uni-terms or n-grams, etc?  Do you stop or stem the content?  How
do you know the data contains the type of clusters you expect mahout to find?
>
> Some suggestions:
> Make sure you remove stop words and use stemming on remaining terms
> Try using bi-grams instead of single terms
> Try building clusters from document headlines only (in the case of news articles)
>
>
>
>
>
> On May 30, 2012, at 1:23 PM, Pat Ferrel wrote:
>
>> I have about 150,000 docs on which I ran canopy with values for t1 = t2 from 0.1
to 0.95 using the Cosine distance measure. I got results that range from 1.5 docs per cluster
to 3. In other words canopy produced a very large number of centroids, which does not seem
to represent the data very well. Trying random values for k seems to produce better results
but still spotty and hard to judge. I am at the point of giving up on canopy and so wrote
a utility to simply iterate k over some values and run the evaluators each time, but there
are currently some problems with CDbw (Inter-Cluster Density is always 0.0 for instance).
>>
>> This seems like such a fundamental problem that others must have found a way to get
better results. Any suggestions?
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message