Also, stop words usually only makes it minutely smaller.
What really makes a vocabulary smaller is eliminating
hapax<http://en.wikipedia.org/wiki/Hapax_legomenon>
.
On Sun, Jun 3, 2012 at 9:23 AM, Lance Norskog <goksron@gmail.com> wrote:
> "100,000 terms even with ngrams, "...
>
> Ummmm... Ngrams will make it bigger, not smaller :)
>
> I haven't studied the text workflows lately. Is there a place where you get
> counts for all words? If so, you can just pick the smallest N counts and
> make a stopword list out of them. This would be a highly valued addition to
> the workflows.
>
> On Fri, Jun 1, 2012 at 9:36 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > I am pretty sure that Suneel meant keep the top 1000 terms per document.
> >
> >
> > On Fri, Jun 1, 2012 at 2:21 AM, Pat Ferrel <pat@occamsmachete.com>
> wrote:
> >
> >> Are you saying that
> >> 1. you threw out all but the top 1000 terms per document by weight? or
> >> 2. your dictionary has only 1000 terms in it and you threw all others
> >> away?
> >>
> >> The later is a simple dimensional reduction trick to try, but 1000 seems
> >> low to me for the entire dictionary.
> >>
> >> A question for you about similarity. I wonder if using all terms is
> >> better for the similarity measure? What is noise in clustering may be
> >> important when looking at cooccurrences. What do you think?
> >>
> >>
> >> On 5/31/12 4:20 PM, Suneel Marthi wrote:
> >>
> >> Pat,
> >>
> >> We have been trying to do something very similar to what u r trying to
> >> accomplish and we ended up with better clusters by considering only the
> top
> >> 1000 terms (by tfidf weight) per doc and using Tanimoto distance.
> >>
> >> Definitely give dimensionality reduction a try and let us know how it
> >> works out.
> >>
> >> 
> >> *From:* Pat Ferrel <pat@occamsmachete.com> <pat@occamsmachete.com>
> >> *To:* user@mahout.apache.org
> >> *Sent:* Thursday, May 31, 2012 6:42 PM
> >> *Subject:* Re: Clustering a large crawl
> >>
> >> Yeah, that's the conclusion I was coming to but thought I'd ask the
> >> experts. My dictionary is petty big. the last time I looked it was over
> >> 100,000 terms even with ngrams, lucene stop words, no numbers, and
> >> stemming. I've tried Tanimoto too with similar results.
> >>
> >> Dimensional reduction seems like the next thing to try.
> >>
> >> Pat
> >>
> >>
> >> Further data from 150,000 docs. Using Canopy clustering I get these
> values
> >> t1 = t2 = 0.3 => 123094 canopies
> >> t1 = t2 = 0.6 => 97035 canopies
> >> t1 = t2 = 0.9 => 60160 canopies
> >> t1 = t2 = 0.91 => 59491 canopies
> >> t1 = t2 = 0.93 => 58526 canopies
> >> t1 = t2 = 0.95 => 57854 canopies
> >> t1 = t2 = 0.97 => 57244 canopies
> >> t1 = t2 = 0.99 => 56241 canopies
> >>
> >>
> >>
> >> On 5/31/12 2:31 PM, Jeff Eastman wrote:
> >>
> >> And I misconstrued your earlier remarks on cluster size vs number of
> >> clusters. As t > 1 you will get fewer and fewer canopies as you have
> >> observed. It actually doesn't seem like the cosine distance measure is
> >> working very well for you.
> >>
> >> Have you mentioned the size of your dictionary earlier? Perhaps
> >> increasing the number of stop words that are rejected will decrease the
> >> vector size and make clustering work better. This seems like the curse
> of
> >> dimensionality at work.
> >>
> >> On 5/31/12 11:18 AM, Pat Ferrel wrote:
> >>
> >> Oops, misspoke. 0 good, 1 bad for clustering at least
> >> For similarity 1 good 0 bad.
> >>
> >> One is a similarity value and the other a distance measure.
> >>
> >> But the primary question is how to get better canopies. I would expect
> >> that as the distance t gets small the number of canopies gets large
> which
> >> is what I see in the data below. Jeff suggests I try much smaller t to
> get
> >> less canopies and I will though I don't understand the logic. The docs
> are
> >> not all that similar. being from a general news crawl.
> >>
> >> When using the CosineDistanceMeasure in Canopy on a corpus of 150,000
> >> docs I get:
> >> t1 = t2 = 0.3 => 123094 canopies
> >> t1 = t2 = 0.6 => 97035 canopies
> >> t1 = t2 = 0.9 => 60160 canopies
> >>
> >> Obviously none of these values for t is very useful and it looks like I
> >> need to make t even larger, which would seem to indicate very
> >> loose/nondense canopies, no? For very large ts are the canopies useful?
> >>
> >> I'm trying both but the other odd thing is that it takes longer to run
> >> canopy on this data than to run kmeans, a lot longer.
> >>
> >> On 5/31/12 12:44 AM, Sean Owen wrote:
> >>
> >> On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com><
> pat@occamsmachete.com>
> >> wrote:
> >>
> >> I see
> >> double denominator = Math.sqrt(lengthSquaredp1) *
> >> Math.sqrt(lengthSquaredp2);
> >> // correct for floatingpoint rounding errors
> >> if (denominator< dotProduct) {
> >> denominator = dotProduct;
> >> }
> >> return 1.0  dotProduct / denominator;
> >>
> >> So this is going to return 1  cosine, right? So for clustering the
> >> distance 1 = very close, 0 = very far.
> >>
> >>
> >> When two vectors are close, the angle between them is small, so the
> >> cosine
> >> is large, near 1. 0 = close, 1 = far, as expected.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
>
>
> 
> Lance Norskog
> goksron@gmail.com
>
