mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Brickley <dan...@danbri.org>
Subject 'bin/mahout rowid' failing
Date Sat, 08 Oct 2011 19:43:56 GMT
>> I'm trying to get my book code sparse vectors into a form that can be
>> usefully SVD'd, now that I have made some successful / plausible
>> clusters using those vectors. I think I need first to transpose them
>> so my columns correspond to records/books not their subject codes, but
>> the transpose job complained with type errors, and searching on those
>> led me to discover the 'rowid' task, which I believe I need to use
>> before I can transpose my matrix. So I seem to be stuck. Is rowid the
>> thing to be using here?

On 7 October 2011 17:36, Ted Dunning <ted.dunning@gmail.com> wrote:
> Actually, if you are clustering books, the books should be rows.

So for the clustering, they were. And simple kmeans was quite rewarding; I
asked for 100000 books to be put into 1000 clusters, and then the
clusterdump summaries show very plausible packages of similar terms. Per
previous thread, I chose to (ab)use the seq2sparse utility and treat my
phrase-based concept codes as words, by regex'ing them into
underscore_based_atoms:

Examples -

               qing_dynasties_1368                     =>
9.295866330464682
               1912                                    =>
5.948517057630751
               painting_chinese_ming                   =>
 3.3054930369059243
               1912_exhibitions                        =>
 2.6730045389246055
               art_chinese_ming                        =>
 2.2829231686062283
               wood                                    =>
 1.8342399243955259
               engraving_chinese_ming                  =>
1.571796558521412
               yuan_dynasties_960                      =>
1.366419615568938
               1368                                    =>
 1.0365213818020291
               porcelain_chinese_ming                  =>
 0.9943387420089157

       Top Terms:
               consciousness_physiology                =>
 11.120451927185059
               buddhism_psychology                     =>
9.916479110717773
               religion_and_medicine                   =>
 9.08357048034668
               neurosciences                           =>
8.960968017578125
               buddhism                                =>
 8.34786319732666

       Top Terms:
               human_evolution                         =>
7.849616527557373
               social_evolution                        =>
 1.6036341407082297
               sociobiology                            =>
1.037030653520064
               biological_evolution                    =>
1.006914080995502
               fossil_hominids                         =>
 0.8849234147505327
               language_and_languages_origin           =>
 0.7156421198989406
               primates_evolution                      =>
 0.6145225871693004
               human_behavior                          =>
 0.5364708177971117
               intellect                               =>
 0.5072070613051906
               human_biology                           =>
 0.4739684191617099


(this is getting off the topic of the original Subject: here, but bear with
me, maybe it's useful Mahout semi-newbie usability feedback?)

...so, flushed with pseudo-success, I thought it was time to have another
look at SVD. By this time I'd got in the habit of poking at Mahout's
previously mysterious binary files using the dump utilities, which helped
make things a bit less confusing.

So, I do svd against the same representation that was kmeans'd above:

mahout svd --cleansvd 1 --rank 100 --input
sparse3/tfidf-vectors/part-r-00000 --output svdout/  --numRows 100000
--numCols 27684

...then cleaned then (I guess '1' doesn't work for --cleaned; the help text
doesn't say), then 'mahout seqdumper --seqFile cleanEigenvectors'

...and I get as expected, a few less than 100 due to the cleaning (88). Each
of these has 27683 values, which is the number of topic codes in my data.

I'm reading this (correct me if I have this backwards) as if my topics are
now data points positioned in a new compressed version of a 'book space'.
What I was after was instead 100000 books in a new lower-dimensioned 'topic
space' (can I say this as: I want left singular vectors but I'm getting
right singular vectors?). Hence the attempt to transpose and rerun Lanczos;
I thought this the conceptually simplest if not most efficient way to get
there. I understand there are other routes but expected this one to work.

Is bin/mahout rowid only failing for me?

Dan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message