mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Clustering a large crawl
Date Fri, 01 Jun 2012 16:36:41 GMT
I am pretty sure that Suneel meant keep the top 1000 terms per document.

On Fri, Jun 1, 2012 at 2:21 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

>  Are you saying that
>   1. you threw out all but the top 1000 terms per document by weight? or
>   2. your dictionary has only 1000 terms in it and you threw all others
> away?
>
> The later is a simple dimensional reduction trick to try, but 1000 seems
> low to me for the entire dictionary.
>
> A question for you about similarity. I wonder if using all terms is better
> for the similarity measure? What is noise in clustering may be important
> when looking at cooccurrences. What do you think?
>
>
> On 5/31/12 4:20 PM, Suneel Marthi wrote:
>
>  Pat,
>
>  We have been trying to do something very similar to what u r trying to
> accomplish and we ended up with better clusters by considering only the top
> 1000 terms (by tf-idf weight) per doc and using Tanimoto distance.
>
>  Definitely give dimensionality reduction a try and let us know how it
> works out.
>
>    ------------------------------
> *From:* Pat Ferrel <pat@occamsmachete.com> <pat@occamsmachete.com>
> *To:* user@mahout.apache.org
> *Sent:* Thursday, May 31, 2012 6:42 PM
> *Subject:* Re: Clustering a large crawl
>
>  Yeah, that's the conclusion I was coming to but thought I'd ask the
> experts. My dictionary is petty big. the last time I looked it was over
> 100,000 terms even with n-grams, lucene stop words, no numbers, and
> stemming. I've tried Tanimoto too with similar results.
>
> Dimensional reduction seems like the next thing to try.
>
> -Pat
>
>
> Further data from 150,000 docs. Using Canopy clustering I get these values
>     t1 = t2 = 0.3 => 123094 canopies
>     t1 = t2 = 0.6 => 97035 canopies
>     t1 = t2 = 0.9 => 60160 canopies
>     t1 = t2 = 0.91 => 59491 canopies
>     t1 = t2 = 0.93 => 58526 canopies
>     t1 = t2 = 0.95 => 57854 canopies
>     t1 = t2 = 0.97 => 57244 canopies
>     t1 = t2 = 0.99 => 56241 canopies
>
>
>
> On 5/31/12 2:31 PM, Jeff Eastman wrote:
>
> And I misconstrued your earlier remarks on cluster size vs number of
> clusters. As t -> 1 you will get fewer and fewer canopies as you have
> observed. It actually doesn't seem like the cosine distance measure is
> working very well for you.
>
> Have you mentioned the size of your dictionary earlier? Perhaps increasing
> the number of stop words that are rejected will decrease the vector size
> and make clustering work better. This seems like the curse of
> dimensionality at work.
>
> On 5/31/12 11:18 AM, Pat Ferrel wrote:
>
> Oops, misspoke. 0 good, 1 bad for clustering at least
> For similarity 1 good 0 bad.
>
> One is a similarity value and the other a distance measure.
>
> But the primary question is how to get better canopies. I would expect
> that as the distance t gets small the number of canopies gets large which
> is what I see in the data below. Jeff suggests I try much smaller t to get
> less canopies and I will though I don't understand the logic. The docs are
> not all that similar. being from a general news crawl.
>
> When using the CosineDistanceMeasure in Canopy on a corpus of 150,000 docs
> I get:
>     t1 = t2 = 0.3 => 123094 canopies
>     t1 = t2 = 0.6 => 97035 canopies
>     t1 = t2 = 0.9 => 60160 canopies
>
> Obviously none of these values for t is very useful and it looks like I
> need to make t even larger, which would seem to indicate very
> loose/non-dense canopies, no? For very large ts are the canopies useful?
>
> I'm trying both but the other odd thing is that it takes longer to run
> canopy on this data than to run kmeans, a lot longer.
>
> On 5/31/12 12:44 AM, Sean Owen wrote:
>
> On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com><pat@occamsmachete.com>
> wrote:
>
> I see
>     double denominator = Math.sqrt(lengthSquaredp1) *
> Math.sqrt(lengthSquaredp2);
>     // correct for floating-point rounding errors
>     if (denominator<  dotProduct) {
>       denominator = dotProduct;
>     }
>     return 1.0 - dotProduct / denominator;
>
> So this is going to return 1 - cosine, right? So for clustering the
> distance 1 = very close, 0 = very far.
>
>
>  When two vectors are close, the angle between them is small, so the
> cosine
> is large, near 1. 0 = close, 1 = far, as expected.
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message