mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Wienert <ste...@wienert.cc>
Subject Re: Need a little help with SVD / Dimensional Reduction
Date Tue, 07 Jun 2011 20:01:27 GMT
Before I rewrite my program, is there any advantage over the lanczos svd?

2011/6/7 Dmitriy Lyubimov <dlieu.7@gmail.com>:
> I am saying i did not test it with 0.20.2
>
> Yes it is integrated in 0.5 release but there might be problems with
> hadoop 0.20.2
>
> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <stefan@wienert.cc> wrote:
>> Hmm... looks nice...
>>
>> So there is a Lanczos implementation of SVD and the stochastic version
>> of SVD. Both produce the doc-concept vectors that I need.
>> I acctualy get my TFIDF Vectors directly from a lucene index (and have
>> to do some magic to get IntWritable, VectorWritable).
>>
>> Still, I do not exactly understand what you trying to say. Your SSVD
>> does not run with mahout but on CDH (what is this btw?)? Or is it
>> available for mahout? So what has to be modified to run it with
>> mahout?
>>
>> Thanks
>> Stefan
>>
>>
>>
>> 2011/6/7 Dmitriy Lyubimov <dlieu.7@gmail.com>:
>>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>>> produce exact U, V and Sigma ,
>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>>
>>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>>> necessarily DoubleWritable (you can still keep your original document
>>> paths or whatever they are identified with), that info is migrated as
>>> keys of the output of U matrix. So you can run it directly on the
>>> output of the Mahout's seq2sparse (I actually tested that on reuters
>>> dataset example from Mahout In Action).
>>>
>>> The computation has a stochastic noise to it, but LSI problems are
>>> never exact problems anyway and their result are subject to corpus
>>> selection.
>>>
>>> If you are interested to help me to work kinks out in Mahout's
>>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>>> CDH in a customized way. But be warned that it may require a number of
>>> patches before it works for you.
>>>
>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>>
>>> Thanks.
>>>
>>> -D
>>>
>>>
>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <stefan@wienert.cc> wrote:
>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>>
>>>> And now your questions:
>>>>
>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>>> similarity to find similar documents. I can also cluster these
>>>> documents with k-means...
>>>>
>>>> If you have any suggestions, feel free to tell me. I'm also interested
>>>> in other document-similarity-techniques.
>>>>
>>>> Cheers,
>>>> Stefan
>>>>
>>>> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>>> If I understand your question correctly, you need to simply transpose
M
>>>>> before you start the run and that way
>>>>> you will get the other singular vectors.
>>>>>
>>>>> May I ask what is the problem you are working on and why do you need
the
>>>>> singular vectors?
>>>>> Can you consider using another matrix decomposition technique for example
>>>>> alternating least squares
>>>>> which gives you two lower rank matrices which simulates the large decomposed
>>>>> matrix?
>>>>>
>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <stefan@wienert.cc>
wrote:
>>>>>
>>>>>> Hi Danny!
>>>>>>
>>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>>> eigenvectors are identical. But that is not exactly what I want.
The
>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>>> are the left singular vectors of M. But I need the right singular
>>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>>
>>>>>> Sorry, my matrix math is not as good as it should be, but I hope
you
>>>>>> can help me!
>>>>>>
>>>>>> Thanks,
>>>>>> Stefan
>>>>>>
>>>>>> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>>>> > Hi Stefan!
>>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors
are
>>>>>> > identical.
>>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>>> decomposition.
>>>>>> > So you don't need to be worried about the other singular vectors.
>>>>>> >
>>>>>> > Hope this helps!
>>>>>> >
>>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <stefan@wienert.cc>
>>>>>> wrote:
>>>>>> >
>>>>>> >> Hi.
>>>>>> >>
>>>>>> >> Thanks for the help.
>>>>>> >>
>>>>>> >> The important points from wikipedia are:
>>>>>> >> - The left singular vectors of M are eigenvectors of M*M'
.
>>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>>> >>
>>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M
(I think
>>>>>> >> it does A=M*M', but it is not a problem). Therefore it does
already
>>>>>> >> calculate the right (or left) singular vector of M.
>>>>>> >>
>>>>>> >> But my question is, how can I get the other singular vector?
I can
>>>>>> >> transpose M, but then I have to calculated two SVDs, one
for the right
>>>>>> >> and one for the left singular value... I think there is
a better way
>>>>>> >> :)
>>>>>> >>
>>>>>> >> Hope you can help me with this...
>>>>>> >> Thanks
>>>>>> >> Stefan
>>>>>> >>
>>>>>> >>
>>>>>> >> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>>>> >> > Hi
>>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>>> >> > Denote the non-square input matrix as M. First a symmetric
matrix A is
>>>>>> >> > computed by A=M'*M
>>>>>> >> > Then an approximating tridiagonal matrix T and a vector
matrix V are
>>>>>> >> > computed such that A =~ V*T*V'
>>>>>> >> > (this process is done in a distributed way).
>>>>>> >> >
>>>>>> >> > Next the matrix T is next decomposed into eigenvectors
and
>>>>>> eignevalues.
>>>>>> >> > Which is the returned result. (This process
>>>>>> >> > is serial).
>>>>>> >> >
>>>>>> >> > The third step makes the returned eigenvectors orthogonal
to each
>>>>>> other
>>>>>> >> > (which is optional IMHO).
>>>>>> >> >
>>>>>> >> > The heart of the code is found at:
>>>>>> >> >
>>>>>> >>
>>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>>> >> > At least that is where it was in version 0.4 I am not
sure if there
>>>>>> are
>>>>>> >> > changes in version 0.5
>>>>>> >> >
>>>>>> >> > Anyway, Mahout does not compute directly SVD. If you
are interested in
>>>>>> >> > learning more about the relation to SVD
>>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>>> >> >
>>>>>> >> > Hope this helps,
>>>>>> >> >
>>>>>> >> > Danny Bickson
>>>>>> >> >
>>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <stefan@wienert.cc>
>>>>>> >> wrote:
>>>>>> >> >
>>>>>> >> >> After reading this thread:
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>>> >> >>
>>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>>> >> >>
>>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>>> >> >>
>>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>>> >> >>
>>>>>> >> >> Is V = M (U S)* (because this is, what the calculation
in the example
>>>>>> >> is)?
>>>>>> >> >>
>>>>>> >> >> Thanks
>>>>>> >> >> Stefan
>>>>>> >> >>
>>>>>> >> >> 2011/6/6 Stefan Wienert <stefan@wienert.cc>:
>>>>>> >> >> >
>>>>>> >>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>> >> >> >
>>>>>> >> >> > What is done:
>>>>>> >> >> >
>>>>>> >> >> > Input:
>>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>>> >> >> >
>>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces
the eigenvector (and
>>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>>> >> >> >
>>>>>> >> >> > transpose tf-idf-matrix:
>>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444
x 6076937
>>>>>> >> >> >
>>>>>> >> >> > transpose svd:
>>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>>> >> >> >
>>>>>> >> >> > matrix multiply:
>>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose =
result
>>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs
x concepts)
>>>>>> >> >> >
>>>>>> >> >> > so... I do understand, that the "svd" here
is not SVD from
>>>>>> wikipedia.
>>>>>> >> >> > It only does the Lanczos algorithm and some
magic which produces
>>>>>> the
>>>>>> >> >> >> Instead either the left or right (but
usually the right)
>>>>>> eigenvectors
>>>>>> >> >> premultiplied by the diagonal or the square root
of the
>>>>>> >> >> >> diagonal element.
>>>>>> >> >> > from
>>>>>> >> >>
>>>>>> >>
>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>>> >> >> >
>>>>>> >> >> > so my question: what is the output of the
SVD in mahout. And what
>>>>>> do I
>>>>>> >> >> > have to calculate to get the "right singular
value" from svd?
>>>>>> >> >> >
>>>>>> >> >> > Thanks,
>>>>>> >> >> > Stefan
>>>>>> >> >> >
>>>>>> >> >> > 2011/6/6 Stefan Wienert <stefan@wienert.cc>:
>>>>>> >> >> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>> >> >> >>
>>>>>> >> >> >> the last step is the matrix multiplication:
>>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>>> >> >> >>
>>>>>> >> >> >> the input has 6,076,937 (each with 20,444
terms). so the result of
>>>>>> >> >> >> matrix multiplication has to be the right
singular value regarding
>>>>>> to
>>>>>> >> >> >> the dimensions.
>>>>>> >> >> >>
>>>>>> >> >> >> so the result is the "concept-document
vector matrix" (as I think,
>>>>>> >> >> >> these is also called "document vectors"
?)
>>>>>> >> >> >>
>>>>>> >> >> >> 2011/6/6 Ted Dunning <ted.dunning@gmail.com>:
>>>>>> >> >> >>> Yes.  These are term vectors, not
document vectors.
>>>>>> >> >> >>>
>>>>>> >> >> >>> There is an additional step that can
be run to produce document
>>>>>> >> >> vectors.
>>>>>> >> >> >>>
>>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan
Wienert <stefan@wienert.cc
>>>>>> >
>>>>>> >> >> wrote:
>>>>>> >> >> >>>
>>>>>> >> >> >>>> compared to SVD, is the result
is the "right singular value"?
>>>>>> >> >> >>>>
>>>>>> >> >> >>>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> --
>>>>>> >> >> >> Stefan Wienert
>>>>>> >> >> >>
>>>>>> >> >> >> http://www.wienert.cc
>>>>>> >> >> >> stefan@wienert.cc
>>>>>> >> >> >>
>>>>>> >> >> >> Telefon: +495251-2026838
>>>>>> >> >> >> Mobil: +49176-40170270
>>>>>> >> >> >>
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > --
>>>>>> >> >> > Stefan Wienert
>>>>>> >> >> >
>>>>>> >> >> > http://www.wienert.cc
>>>>>> >> >> > stefan@wienert.cc
>>>>>> >> >> >
>>>>>> >> >> > Telefon: +495251-2026838
>>>>>> >> >> > Mobil: +49176-40170270
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> --
>>>>>> >> >> Stefan Wienert
>>>>>> >> >>
>>>>>> >> >> http://www.wienert.cc
>>>>>> >> >> stefan@wienert.cc
>>>>>> >> >>
>>>>>> >> >> Telefon: +495251-2026838
>>>>>> >> >> Mobil: +49176-40170270
>>>>>> >> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Stefan Wienert
>>>>>> >>
>>>>>> >> http://www.wienert.cc
>>>>>> >> stefan@wienert.cc
>>>>>> >>
>>>>>> >> Telefon: +495251-2026838
>>>>>> >> Mobil: +49176-40170270
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> stefan@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Mime
View raw message