mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: need help on mahout
Date Fri, 09 Nov 2012 21:45:35 GMT
another correction: CVB= *collapsed* variational Bayes.

On Fri, Nov 9, 2012 at 1:37 PM, Dmitriy Lyubimov <> wrote:

> correction, with LSA you probably want to use rows of U or U*sqrt(Sigma)
> (ssvd --uHalfSigma option), not U*Sigma.
> On Fri, Nov 9, 2012 at 1:34 PM, Dmitriy Lyubimov <>wrote:
>> No SGD (stochastic gradient descent) and factorization are two different
>> things. More strictly, those are two different classes of problems --
>> factorization and regression. SGD is one implementation for regression
>> classifcation. Factorization is finding virtual factors in a user/item
>> space (ALS-WR is one of the methods to find such factors).
>> Yes SGD is in the book but not with your example specifically since I
>> meant to apply it after you find latent variables (factors, whatever).
>> You will get more help on ALS-WR method by staying on the list and also
>> perhaps create an archive entry for others to follow in a similar
>> situation. The idea is that we all learn together and effectively:) (and i
>> score more points for support :)
>> CVB (if i am not totally off) is something called continuous variational
>> Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
>> you to analyze content of your web pages IF you manage to grab the text off
>> of them. in Mahout, it is facilitated by a package here:
>> don't know where exactly wiki help on CVB is, but searching mahout archive
>> and stack overflow may help. Again, by staing on the list you may get more
>> help with that.
>> LSA (Latent semantic analysis) is another way to analyze the content of
>> you web. See a wikipedia article for refresher, but basically it is a run
>> of SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
>> prepare that context data with seqdirectory, seq2sparse commands (again,
>> you can find details in the book). Then you just run 'mahout ssvd
>> <options>' on the output of seq2sparse and use rows of U*Sigma output for
>> the topical allocation values. Somebody will probably correct me on this,
>> but I think you can use topical allocation values to further build your
>> classification with regressions (SGD).
>> -d
>> On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <>wrote:
>>> Hi Dmitriy,
>>> Many thanks for your comments and i really appreciate although I think I
>>> may not fully understood you.
>>> As I understand, SGD mean stochastic gradient descent, is that right?
>>> I What I need now is some example code to :  read the files, construct the
>>> web page set, then form the vectors. Such steps are called 'factorization'
>>> in Mahout, right?
>>> Do you mean Mahout in Action has examples similar to what I described?
>>> what is CVB and LSA, and SSVD (singular value decomposition?)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message