mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Need a little help with SVD / Dimensional Reduction
Date Tue, 07 Jun 2011 19:58:53 GMT
I am saying i did not test it with 0.20.2

Yes it is integrated in 0.5 release but there might be problems with
hadoop 0.20.2

On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <stefan@wienert.cc> wrote:
> Hmm... looks nice...
>
> So there is a Lanczos implementation of SVD and the stochastic version
> of SVD. Both produce the doc-concept vectors that I need.
> I acctualy get my TFIDF Vectors directly from a lucene index (and have
> to do some magic to get IntWritable, VectorWritable).
>
> Still, I do not exactly understand what you trying to say. Your SSVD
> does not run with mahout but on CDH (what is this btw?)? Or is it
> available for mahout? So what has to be modified to run it with
> mahout?
>
> Thanks
> Stefan
>
>
>
> 2011/6/7 Dmitriy Lyubimov <dlieu.7@gmail.com>:
>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>> produce exact U, V and Sigma ,
>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>
>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>> necessarily DoubleWritable (you can still keep your original document
>> paths or whatever they are identified with), that info is migrated as
>> keys of the output of U matrix. So you can run it directly on the
>> output of the Mahout's seq2sparse (I actually tested that on reuters
>> dataset example from Mahout In Action).
>>
>> The computation has a stochastic noise to it, but LSI problems are
>> never exact problems anyway and their result are subject to corpus
>> selection.
>>
>> If you are interested to help me to work kinks out in Mahout's
>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>> CDH in a customized way. But be warned that it may require a number of
>> patches before it works for you.
>>
>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>
>> Thanks.
>>
>> -D
>>
>>
>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <stefan@wienert.cc> wrote:
>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>
>>> And now your questions:
>>>
>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>> similarity to find similar documents. I can also cluster these
>>> documents with k-means...
>>>
>>> If you have any suggestions, feel free to tell me. I'm also interested
>>> in other document-similarity-techniques.
>>>
>>> Cheers,
>>> Stefan
>>>
>>> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>> If I understand your question correctly, you need to simply transpose M
>>>> before you start the run and that way
>>>> you will get the other singular vectors.
>>>>
>>>> May I ask what is the problem you are working on and why do you need the
>>>> singular vectors?
>>>> Can you consider using another matrix decomposition technique for example
>>>> alternating least squares
>>>> which gives you two lower rank matrices which simulates the large decomposed
>>>> matrix?
>>>>
>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <stefan@wienert.cc>
wrote:
>>>>
>>>>> Hi Danny!
>>>>>
>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>> are the left singular vectors of M. But I need the right singular
>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>
>>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>>> can help me!
>>>>>
>>>>> Thanks,
>>>>> Stefan
>>>>>
>>>>> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>>> > Hi Stefan!
>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors
are
>>>>> > identical.
>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>> decomposition.
>>>>> > So you don't need to be worried about the other singular vectors.
>>>>> >
>>>>> > Hope this helps!
>>>>> >
>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <stefan@wienert.cc>
>>>>> wrote:
>>>>> >
>>>>> >> Hi.
>>>>> >>
>>>>> >> Thanks for the help.
>>>>> >>
>>>>> >> The important points from wikipedia are:
>>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>> >>
>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M
(I think
>>>>> >> it does A=M*M', but it is not a problem). Therefore it does
already
>>>>> >> calculate the right (or left) singular vector of M.
>>>>> >>
>>>>> >> But my question is, how can I get the other singular vector?
I can
>>>>> >> transpose M, but then I have to calculated two SVDs, one for
the right
>>>>> >> and one for the left singular value... I think there is a better
way
>>>>> >> :)
>>>>> >>
>>>>> >> Hope you can help me with this...
>>>>> >> Thanks
>>>>> >> Stefan
>>>>> >>
>>>>> >>
>>>>> >> 2011/6/6 Danny Bickson <danny.bickson@gmail.com>:
>>>>> >> > Hi
>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>> >> > Denote the non-square input matrix as M. First a symmetric
matrix A is
>>>>> >> > computed by A=M'*M
>>>>> >> > Then an approximating tridiagonal matrix T and a vector
matrix V are
>>>>> >> > computed such that A =~ V*T*V'
>>>>> >> > (this process is done in a distributed way).
>>>>> >> >
>>>>> >> > Next the matrix T is next decomposed into eigenvectors
and
>>>>> eignevalues.
>>>>> >> > Which is the returned result. (This process
>>>>> >> > is serial).
>>>>> >> >
>>>>> >> > The third step makes the returned eigenvectors orthogonal
to each
>>>>> other
>>>>> >> > (which is optional IMHO).
>>>>> >> >
>>>>> >> > The heart of the code is found at:
>>>>> >> >
>>>>> >>
>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>> >> > At least that is where it was in version 0.4 I am not sure
if there
>>>>> are
>>>>> >> > changes in version 0.5
>>>>> >> >
>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are
interested in
>>>>> >> > learning more about the relation to SVD
>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>> >> >
>>>>> >> > Hope this helps,
>>>>> >> >
>>>>> >> > Danny Bickson
>>>>> >> >
>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <stefan@wienert.cc>
>>>>> >> wrote:
>>>>> >> >
>>>>> >> >> After reading this thread:
>>>>> >> >>
>>>>> >> >>
>>>>> >>
>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E
>>>>> >> >>
>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>> >> >>
>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>> >> >>
>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>> >> >>
>>>>> >> >> Is V = M (U S)* (because this is, what the calculation
in the example
>>>>> >> is)?
>>>>> >> >>
>>>>> >> >> Thanks
>>>>> >> >> Stefan
>>>>> >> >>
>>>>> >> >> 2011/6/6 Stefan Wienert <stefan@wienert.cc>:
>>>>> >> >> >
>>>>> >>
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>> >> >> >
>>>>> >> >> > What is done:
>>>>> >> >> >
>>>>> >> >> > Input:
>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>> >> >> >
>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the
eigenvector (and
>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>> >> >> >
>>>>> >> >> > transpose tf-idf-matrix:
>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x
6076937
>>>>> >> >> >
>>>>> >> >> > transpose svd:
>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>> >> >> >
>>>>> >> >> > matrix multiply:
>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x
concepts)
>>>>> >> >> >
>>>>> >> >> > so... I do understand, that the "svd" here is
not SVD from
>>>>> wikipedia.
>>>>> >> >> > It only does the Lanczos algorithm and some magic
which produces
>>>>> the
>>>>> >> >> >> Instead either the left or right (but usually
the right)
>>>>> eigenvectors
>>>>> >> >> premultiplied by the diagonal or the square root of
the
>>>>> >> >> >> diagonal element.
>>>>> >> >> > from
>>>>> >> >>
>>>>> >>
>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=Rta7tfRm8Zi60VcFya5xF+dbFrJ8pcds2N0-V@mail.gmail.com%3E
>>>>> >> >> >
>>>>> >> >> > so my question: what is the output of the SVD
in mahout. And what
>>>>> do I
>>>>> >> >> > have to calculate to get the "right singular value"
from svd?
>>>>> >> >> >
>>>>> >> >> > Thanks,
>>>>> >> >> > Stefan
>>>>> >> >> >
>>>>> >> >> > 2011/6/6 Stefan Wienert <stefan@wienert.cc>:
>>>>> >> >> >>
>>>>> >> >>
>>>>> >>
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>> >> >> >>
>>>>> >> >> >> the last step is the matrix multiplication:
>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>> >> >> >>
>>>>> >> >> >> the input has 6,076,937 (each with 20,444
terms). so the result of
>>>>> >> >> >> matrix multiplication has to be the right
singular value regarding
>>>>> to
>>>>> >> >> >> the dimensions.
>>>>> >> >> >>
>>>>> >> >> >> so the result is the "concept-document vector
matrix" (as I think,
>>>>> >> >> >> these is also called "document vectors" ?)
>>>>> >> >> >>
>>>>> >> >> >> 2011/6/6 Ted Dunning <ted.dunning@gmail.com>:
>>>>> >> >> >>> Yes.  These are term vectors, not document
vectors.
>>>>> >> >> >>>
>>>>> >> >> >>> There is an additional step that can be
run to produce document
>>>>> >> >> vectors.
>>>>> >> >> >>>
>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan
Wienert <stefan@wienert.cc
>>>>> >
>>>>> >> >> wrote:
>>>>> >> >> >>>
>>>>> >> >> >>>> compared to SVD, is the result is
the "right singular value"?
>>>>> >> >> >>>>
>>>>> >> >> >>>
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >>
>>>>> >> >> >> --
>>>>> >> >> >> Stefan Wienert
>>>>> >> >> >>
>>>>> >> >> >> http://www.wienert.cc
>>>>> >> >> >> stefan@wienert.cc
>>>>> >> >> >>
>>>>> >> >> >> Telefon: +495251-2026838
>>>>> >> >> >> Mobil: +49176-40170270
>>>>> >> >> >>
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > --
>>>>> >> >> > Stefan Wienert
>>>>> >> >> >
>>>>> >> >> > http://www.wienert.cc
>>>>> >> >> > stefan@wienert.cc
>>>>> >> >> >
>>>>> >> >> > Telefon: +495251-2026838
>>>>> >> >> > Mobil: +49176-40170270
>>>>> >> >> >
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> --
>>>>> >> >> Stefan Wienert
>>>>> >> >>
>>>>> >> >> http://www.wienert.cc
>>>>> >> >> stefan@wienert.cc
>>>>> >> >>
>>>>> >> >> Telefon: +495251-2026838
>>>>> >> >> Mobil: +49176-40170270
>>>>> >> >>
>>>>> >> >
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Stefan Wienert
>>>>> >>
>>>>> >> http://www.wienert.cc
>>>>> >> stefan@wienert.cc
>>>>> >>
>>>>> >> Telefon: +495251-2026838
>>>>> >> Mobil: +49176-40170270
>>>>> >>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Wienert
>>>>>
>>>>> http://www.wienert.cc
>>>>> stefan@wienert.cc
>>>>>
>>>>> Telefon: +495251-2026838
>>>>> Mobil: +49176-40170270
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> stefan@wienert.cc
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Mime
View raw message