mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Clustering a large crawl
Date Sun, 03 Jun 2012 07:34:06 GMT
Also, stop words usually only makes it minutely smaller.

What really makes a vocabulary smaller is eliminating
hapax<http://en.wikipedia.org/wiki/Hapax_legomenon>
.

On Sun, Jun 3, 2012 at 9:23 AM, Lance Norskog <goksron@gmail.com> wrote:

> "100,000 terms even with n-grams, "...
>
> Ummmm... N-grams will make it bigger, not smaller :)
>
> I haven't studied the text workflows lately. Is there a place where you get
> counts for all words? If so, you can just pick the smallest N counts and
> make a stopword list out of them. This would be a highly valued addition to
> the workflows.
>
> On Fri, Jun 1, 2012 at 9:36 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > I am pretty sure that Suneel meant keep the top 1000 terms per document.
> >
> >
> > On Fri, Jun 1, 2012 at 2:21 AM, Pat Ferrel <pat@occamsmachete.com>
> wrote:
> >
> >>  Are you saying that
> >>   1. you threw out all but the top 1000 terms per document by weight? or
> >>   2. your dictionary has only 1000 terms in it and you threw all others
> >> away?
> >>
> >> The later is a simple dimensional reduction trick to try, but 1000 seems
> >> low to me for the entire dictionary.
> >>
> >> A question for you about similarity. I wonder if using all terms is
> >> better for the similarity measure? What is noise in clustering may be
> >> important when looking at cooccurrences. What do you think?
> >>
> >>
> >> On 5/31/12 4:20 PM, Suneel Marthi wrote:
> >>
> >>  Pat,
> >>
> >>  We have been trying to do something very similar to what u r trying to
> >> accomplish and we ended up with better clusters by considering only the
> top
> >> 1000 terms (by tf-idf weight) per doc and using Tanimoto distance.
> >>
> >>  Definitely give dimensionality reduction a try and let us know how it
> >> works out.
> >>
> >>    ------------------------------
> >> *From:* Pat Ferrel <pat@occamsmachete.com> <pat@occamsmachete.com>
> >> *To:* user@mahout.apache.org
> >> *Sent:* Thursday, May 31, 2012 6:42 PM
> >> *Subject:* Re: Clustering a large crawl
> >>
> >>  Yeah, that's the conclusion I was coming to but thought I'd ask the
> >> experts. My dictionary is petty big. the last time I looked it was over
> >> 100,000 terms even with n-grams, lucene stop words, no numbers, and
> >> stemming. I've tried Tanimoto too with similar results.
> >>
> >> Dimensional reduction seems like the next thing to try.
> >>
> >> -Pat
> >>
> >>
> >> Further data from 150,000 docs. Using Canopy clustering I get these
> values
> >>     t1 = t2 = 0.3 => 123094 canopies
> >>     t1 = t2 = 0.6 => 97035 canopies
> >>     t1 = t2 = 0.9 => 60160 canopies
> >>     t1 = t2 = 0.91 => 59491 canopies
> >>     t1 = t2 = 0.93 => 58526 canopies
> >>     t1 = t2 = 0.95 => 57854 canopies
> >>     t1 = t2 = 0.97 => 57244 canopies
> >>     t1 = t2 = 0.99 => 56241 canopies
> >>
> >>
> >>
> >> On 5/31/12 2:31 PM, Jeff Eastman wrote:
> >>
> >> And I misconstrued your earlier remarks on cluster size vs number of
> >> clusters. As t -> 1 you will get fewer and fewer canopies as you have
> >> observed. It actually doesn't seem like the cosine distance measure is
> >> working very well for you.
> >>
> >> Have you mentioned the size of your dictionary earlier? Perhaps
> >> increasing the number of stop words that are rejected will decrease the
> >> vector size and make clustering work better. This seems like the curse
> of
> >> dimensionality at work.
> >>
> >> On 5/31/12 11:18 AM, Pat Ferrel wrote:
> >>
> >> Oops, misspoke. 0 good, 1 bad for clustering at least
> >> For similarity 1 good 0 bad.
> >>
> >> One is a similarity value and the other a distance measure.
> >>
> >> But the primary question is how to get better canopies. I would expect
> >> that as the distance t gets small the number of canopies gets large
> which
> >> is what I see in the data below. Jeff suggests I try much smaller t to
> get
> >> less canopies and I will though I don't understand the logic. The docs
> are
> >> not all that similar. being from a general news crawl.
> >>
> >> When using the CosineDistanceMeasure in Canopy on a corpus of 150,000
> >> docs I get:
> >>     t1 = t2 = 0.3 => 123094 canopies
> >>     t1 = t2 = 0.6 => 97035 canopies
> >>     t1 = t2 = 0.9 => 60160 canopies
> >>
> >> Obviously none of these values for t is very useful and it looks like I
> >> need to make t even larger, which would seem to indicate very
> >> loose/non-dense canopies, no? For very large ts are the canopies useful?
> >>
> >> I'm trying both but the other odd thing is that it takes longer to run
> >> canopy on this data than to run kmeans, a lot longer.
> >>
> >> On 5/31/12 12:44 AM, Sean Owen wrote:
> >>
> >> On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com><
> pat@occamsmachete.com>
> >> wrote:
> >>
> >> I see
> >>     double denominator = Math.sqrt(lengthSquaredp1) *
> >> Math.sqrt(lengthSquaredp2);
> >>     // correct for floating-point rounding errors
> >>     if (denominator<  dotProduct) {
> >>       denominator = dotProduct;
> >>     }
> >>     return 1.0 - dotProduct / denominator;
> >>
> >> So this is going to return 1 - cosine, right? So for clustering the
> >> distance 1 = very close, 0 = very far.
> >>
> >>
> >>  When two vectors are close, the angle between them is small, so the
> >> cosine
> >> is large, near 1. 0 = close, 1 = far, as expected.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message