mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Clustering a large crawl
Date Sun, 03 Jun 2012 07:23:44 GMT
"100,000 terms even with n-grams, "...

Ummmm... N-grams will make it bigger, not smaller :)

I haven't studied the text workflows lately. Is there a place where you get
counts for all words? If so, you can just pick the smallest N counts and
make a stopword list out of them. This would be a highly valued addition to
the workflows.

On Fri, Jun 1, 2012 at 9:36 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I am pretty sure that Suneel meant keep the top 1000 terms per document.
>
>
> On Fri, Jun 1, 2012 at 2:21 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
>
>>  Are you saying that
>>   1. you threw out all but the top 1000 terms per document by weight? or
>>   2. your dictionary has only 1000 terms in it and you threw all others
>> away?
>>
>> The later is a simple dimensional reduction trick to try, but 1000 seems
>> low to me for the entire dictionary.
>>
>> A question for you about similarity. I wonder if using all terms is
>> better for the similarity measure? What is noise in clustering may be
>> important when looking at cooccurrences. What do you think?
>>
>>
>> On 5/31/12 4:20 PM, Suneel Marthi wrote:
>>
>>  Pat,
>>
>>  We have been trying to do something very similar to what u r trying to
>> accomplish and we ended up with better clusters by considering only the top
>> 1000 terms (by tf-idf weight) per doc and using Tanimoto distance.
>>
>>  Definitely give dimensionality reduction a try and let us know how it
>> works out.
>>
>>    ------------------------------
>> *From:* Pat Ferrel <pat@occamsmachete.com> <pat@occamsmachete.com>
>> *To:* user@mahout.apache.org
>> *Sent:* Thursday, May 31, 2012 6:42 PM
>> *Subject:* Re: Clustering a large crawl
>>
>>  Yeah, that's the conclusion I was coming to but thought I'd ask the
>> experts. My dictionary is petty big. the last time I looked it was over
>> 100,000 terms even with n-grams, lucene stop words, no numbers, and
>> stemming. I've tried Tanimoto too with similar results.
>>
>> Dimensional reduction seems like the next thing to try.
>>
>> -Pat
>>
>>
>> Further data from 150,000 docs. Using Canopy clustering I get these values
>>     t1 = t2 = 0.3 => 123094 canopies
>>     t1 = t2 = 0.6 => 97035 canopies
>>     t1 = t2 = 0.9 => 60160 canopies
>>     t1 = t2 = 0.91 => 59491 canopies
>>     t1 = t2 = 0.93 => 58526 canopies
>>     t1 = t2 = 0.95 => 57854 canopies
>>     t1 = t2 = 0.97 => 57244 canopies
>>     t1 = t2 = 0.99 => 56241 canopies
>>
>>
>>
>> On 5/31/12 2:31 PM, Jeff Eastman wrote:
>>
>> And I misconstrued your earlier remarks on cluster size vs number of
>> clusters. As t -> 1 you will get fewer and fewer canopies as you have
>> observed. It actually doesn't seem like the cosine distance measure is
>> working very well for you.
>>
>> Have you mentioned the size of your dictionary earlier? Perhaps
>> increasing the number of stop words that are rejected will decrease the
>> vector size and make clustering work better. This seems like the curse of
>> dimensionality at work.
>>
>> On 5/31/12 11:18 AM, Pat Ferrel wrote:
>>
>> Oops, misspoke. 0 good, 1 bad for clustering at least
>> For similarity 1 good 0 bad.
>>
>> One is a similarity value and the other a distance measure.
>>
>> But the primary question is how to get better canopies. I would expect
>> that as the distance t gets small the number of canopies gets large which
>> is what I see in the data below. Jeff suggests I try much smaller t to get
>> less canopies and I will though I don't understand the logic. The docs are
>> not all that similar. being from a general news crawl.
>>
>> When using the CosineDistanceMeasure in Canopy on a corpus of 150,000
>> docs I get:
>>     t1 = t2 = 0.3 => 123094 canopies
>>     t1 = t2 = 0.6 => 97035 canopies
>>     t1 = t2 = 0.9 => 60160 canopies
>>
>> Obviously none of these values for t is very useful and it looks like I
>> need to make t even larger, which would seem to indicate very
>> loose/non-dense canopies, no? For very large ts are the canopies useful?
>>
>> I'm trying both but the other odd thing is that it takes longer to run
>> canopy on this data than to run kmeans, a lot longer.
>>
>> On 5/31/12 12:44 AM, Sean Owen wrote:
>>
>> On Thu, May 31, 2012 at 12:36 AM, Pat Ferrel<pat@occamsmachete.com><pat@occamsmachete.com>
>> wrote:
>>
>> I see
>>     double denominator = Math.sqrt(lengthSquaredp1) *
>> Math.sqrt(lengthSquaredp2);
>>     // correct for floating-point rounding errors
>>     if (denominator<  dotProduct) {
>>       denominator = dotProduct;
>>     }
>>     return 1.0 - dotProduct / denominator;
>>
>> So this is going to return 1 - cosine, right? So for clustering the
>> distance 1 = very close, 0 = very far.
>>
>>
>>  When two vectors are close, the angle between them is small, so the
>> cosine
>> is large, near 1. 0 = close, 1 = far, as expected.
>>
>>
>>
>>
>>
>>
>>
>


-- 
Lance Norskog
goksron@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message