mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: need help on mahout
Date Fri, 09 Nov 2012 21:34:59 GMT
No SGD (stochastic gradient descent) and factorization are two different
things. More strictly, those are two different classes of problems --
factorization and regression. SGD is one implementation for regression
classifcation. Factorization is finding virtual factors in a user/item
space (ALS-WR is one of the methods to find such factors).

Yes SGD is in the book but not with your example specifically since I meant
to apply it after you find latent variables (factors, whatever).

You will get more help on ALS-WR method by staying on the list and also
perhaps create an archive entry for others to follow in a similar
situation. The idea is that we all learn together and effectively:) (and i
score more points for support :)

CVB (if i am not totally off) is something called continuous variational
Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
you to analyze content of your web pages IF you manage to grab the text off
of them. in Mahout, it is facilitated by a package here:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/clustering/lda/cvb/package-summary.html
I
don't know where exactly wiki help on CVB is, but searching mahout archive
and stack overflow may help. Again, by staing on the list you may get more
help with that.

LSA (Latent semantic analysis) is another way to analyze the content of you
web. See a wikipedia article for refresher, but basically it is a run of
SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
prepare that context data with seqdirectory, seq2sparse commands (again,
you can find details in the book). Then you just run 'mahout ssvd
<options>' on the output of seq2sparse and use rows of U*Sigma output for
the topical allocation values. Somebody will probably correct me on this,
but I think you can use topical allocation values to further build your
classification with regressions (SGD).

-d


On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <qiaoresearcher@gmail.com>wrote:

> Hi Dmitriy,
>
> Many thanks for your comments and i really appreciate although I think I
> may not fully understood you.
>
> As I understand, SGD mean stochastic gradient descent, is that right?
> I What I need now is some example code to :  read the files, construct the
> web page set, then form the vectors. Such steps are called 'factorization'
> in Mahout, right?
>
> Do you mean Mahout in Action has examples similar to what I described?
> what is CVB and LSA, and SSVD (singular value decomposition?)
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message