mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Wienert <ste...@wienert.cc>
Subject Re: Need a little help with SVD / Dimensional Reduction
Date Tue, 07 Jun 2011 19:55:45 GMT
Hmm... looks nice...

So there is a Lanczos implementation of SVD and the stochastic version
of SVD. Both produce the doc-concept vectors that I need.
I acctualy get my TFIDF Vectors directly from a lucene index (and have
to do some magic to get IntWritable, VectorWritable).

Still, I do not exactly understand what you trying to say. Your SSVD
does not run with mahout but on CDH (what is this btw?)? Or is it
available for mahout? So what has to be modified to run it with
mahout?

Thanks
Stefan



2011/6/7 Dmitriy Lyubimov <dlieu.7@gmail.com>:
> I also do LSI/LSA and wrote a stochastic svd that is capable to
> produce exact U, V and Sigma ,
> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>
> On top of that, it is adjusted to be LSI-friendly if your keys are not
> necessarily DoubleWritable (you can still keep your original document
> paths or whatever they are identified with), that info is migrated as
> keys of the output of U matrix. So you can run it directly on the
> output of the Mahout's seq2sparse (I actually tested that on reuters
> dataset example from Mahout In Action).
>
> The computation has a stochastic noise to it, but LSI problems are
> never exact problems anyway and their result are subject to corpus
> selection.
>
> If you are interested to help me to work kinks out in Mahout's
> version, I'd be grateful since i don't run the method on 0.20.2 but on
> CDH in a customized way. But be warned that it may require a number of
> patches before it works for you.
>
> Here is (a little bit too wordy) command line manual for Mahout 0.5.
> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>
> Thanks.
>
> -D
>
>
> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <stefan@wienert.cc> wrote:
>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>
>> And now your questions:
>>
>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>> Document-Concept-Vectors. These Vectors will be compared with cosine
>> similarity to find similar documents. I can also cluster these
>> documents with k-means...
>>
>> If you have any suggestions, feel free to tell me. I'm also interested
>> in other document-similarity-techniques.
>>
>> Cheers,
>> Stefan
>>
>> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>> If I understand your question correctly, you need to simply transpose M
>>> before you start the run and that way
>>> you will get the other singular vectors.
>>>
>>> May I ask what is the problem you are working on and why do you need the
>>> singular vectors?
>>> Can you consider using another matrix decomposition technique for example
>>> alternating least squares
>>> which gives you two lower rank matrices which simulates the large decomposed
>>> matrix?
>>>
>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <stefan@wienert.cc> wrote:
>>>
>>>> Hi Danny!
>>>>
>>>> I understand that for M*M' (and for M'*M) the left and right
>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>> are the left singular vectors of M. But I need the right singular
>>>> vectors of M (and not M*M'). How do I get them?
>>>>
>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>> can help me!
>>>>
>>>> Thanks,
>>>> Stefan
>>>>
>>>> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>> > Hi Stefan!
>>>> > For a positive semidefinite matrix, the lest and right eigenvectors
are
>>>> > identical.
>>>> > See SVD wikipeida text: When *M* is also positive
>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>> decomposition.
>>>> > So you don't need to be worried about the other singular vectors.
>>>> >
>>>> > Hope this helps!
>>>> >
>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <stefan@wienert.cc>
>>>> wrote:
>>>> >
>>>> >> Hi.
>>>> >>
>>>> >> Thanks for the help.
>>>> >>
>>>> >> The important points from wikipedia are:
>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>> >>
>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>> >> calculate the right (or left) singular vector of M.
>>>> >>
>>>> >> But my question is, how can I get the other singular vector? I can
>>>> >> transpose M, but then I have to calculated two SVDs, one for the
right
>>>> >> and one for the left singular value... I think there is a better
way
>>>> >> :)
>>>> >>
>>>> >> Hope you can help me with this...
>>>> >> Thanks
>>>> >> Stefan
>>>> >>
>>>> >>
>>>> >> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>> >> > Hi
>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>> >> > Denote the non-square input matrix as M. First a symmetric
matrix A is
>>>> >> > computed by A=M'*M
>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix
V are
>>>> >> > computed such that A =~ V*T*V'
>>>> >> > (this process is done in a distributed way).
>>>> >> >
>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>> eignevalues.
>>>> >> > Which is the returned result. (This process
>>>> >> > is serial).
>>>> >> >
>>>> >> > The third step makes the returned eigenvectors orthogonal to
each
>>>> other
>>>> >> > (which is optional IMHO).
>>>> >> >
>>>> >> > The heart of the code is found at:
>>>> >> >
>>>> >>
>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>> >> > At least that is where it was in version 0.4 I am not sure
if there
>>>> are
>>>> >> > changes in version 0.5
>>>> >> >
>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested
in
>>>> >> > learning more about the relation to SVD
>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>> >> > subsection: relation to eigenvalue decomposition.
>>>> >> >
>>>> >> > Hope this helps,
>>>> >> >
>>>> >> > Danny Bickson
>>>> >> >
>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <stefan@wienert.cc>
>>>> >> wrote:
>>>> >> >
>>>> >> >> After reading this thread:
>>>> >> >>
>>>> >> >>
>>>> >>
>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>> >> >>
>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>> >> >>
>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>> >> >>
>>>> >> >> So... How do I get V from (U S)  and M?
>>>> >> >>
>>>> >> >> Is V = M (U S)* (because this is, what the calculation
in the example
>>>> >> is)?
>>>> >> >>
>>>> >> >> Thanks
>>>> >> >> Stefan
>>>> >> >>
>>>> >> >> 2011/6/6 Stefan Wienert <stefan@wienert.cc>:
>>>> >> >> >
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>> >> >> >
>>>> >> >> > What is done:
>>>> >> >> >
>>>> >> >> > Input:
>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>> >> >> >
>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector
(and
>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>> >> >> >
>>>> >> >> > transpose tf-idf-matrix:
>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>> >> >> >
>>>> >> >> > transpose svd:
>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>> >> >> >
>>>> >> >> > matrix multiply:
>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>> >> >> >
>>>> >> >> > so... I do understand, that the "svd" here is not
SVD from
>>>> wikipedia.
>>>> >> >> > It only does the Lanczos algorithm and some magic
which produces
>>>> the
>>>> >> >> >> Instead either the left or right (but usually
the right)
>>>> eigenvectors
>>>> >> >> premultiplied by the diagonal or the square root of the
>>>> >> >> >> diagonal element.
>>>> >> >> > from
>>>> >> >>
>>>> >>
>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>> >> >> >
>>>> >> >> > so my question: what is the output of the SVD in mahout.
And what
>>>> do I
>>>> >> >> > have to calculate to get the "right singular value"
from svd?
>>>> >> >> >
>>>> >> >> > Thanks,
>>>> >> >> > Stefan
>>>> >> >> >
>>>> >> >> > 2011/6/6 Stefan Wienert <stefan@wienert.cc>:
>>>> >> >> >>
>>>> >> >>
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>> >> >> >>
>>>> >> >> >> the last step is the matrix multiplication:
>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>> >> >> >>
>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms).
so the result of
>>>> >> >> >> matrix multiplication has to be the right singular
value regarding
>>>> to
>>>> >> >> >> the dimensions.
>>>> >> >> >>
>>>> >> >> >> so the result is the "concept-document vector
matrix" (as I think,
>>>> >> >> >> these is also called "document vectors" ?)
>>>> >> >> >>
>>>> >> >> >> 2011/6/6 Ted Dunning <ted.dunning@gmail.com>:
>>>> >> >> >>> Yes.  These are term vectors, not document
vectors.
>>>> >> >> >>>
>>>> >> >> >>> There is an additional step that can be run
to produce document
>>>> >> >> vectors.
>>>> >> >> >>>
>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert
<stefan@wienert.cc
>>>> >
>>>> >> >> wrote:
>>>> >> >> >>>
>>>> >> >> >>>> compared to SVD, is the result is the
"right singular value"?
>>>> >> >> >>>>
>>>> >> >> >>>
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >> --
>>>> >> >> >> Stefan Wienert
>>>> >> >> >>
>>>> >> >> >> http://www.wienert.cc
>>>> >> >> >> stefan@wienert.cc
>>>> >> >> >>
>>>> >> >> >> Telefon: +495251-2026838
>>>> >> >> >> Mobil: +49176-40170270
>>>> >> >> >>
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Stefan Wienert
>>>> >> >> >
>>>> >> >> > http://www.wienert.cc
>>>> >> >> > stefan@wienert.cc
>>>> >> >> >
>>>> >> >> > Telefon: +495251-2026838
>>>> >> >> > Mobil: +49176-40170270
>>>> >> >> >
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> --
>>>> >> >> Stefan Wienert
>>>> >> >>
>>>> >> >> http://www.wienert.cc
>>>> >> >> stefan@wienert.cc
>>>> >> >>
>>>> >> >> Telefon: +495251-2026838
>>>> >> >> Mobil: +49176-40170270
>>>> >> >>
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Stefan Wienert
>>>> >>
>>>> >> http://www.wienert.cc
>>>> >> stefan@wienert.cc
>>>> >>
>>>> >> Telefon: +495251-2026838
>>>> >> Mobil: +49176-40170270
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> stefan@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
stefan@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Mime
View raw message