mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: need help on mahout
Date Fri, 09 Nov 2012 22:11:11 GMT
The confusion here may be over the term "supervised" 

Supervised classification assumes you know which group each user is in, and the classifier
builds a model to classify new users into the predefined groups. Usually there is a classifier
for each group that, when given a user vector, return how likely the user is a member of that

Clustering is an unsupervised classifier which sees the groups without being told which user
is in which group. It does this by finding structure in the data itself.

If you don't know the groups ahead of time you want to cluster. If you are classifying users
based on known groups of previous users you want to build a classifier and mahout has both.

You probably need to create the vectors using Mahout code. Your matrix of users and pages
visited could be very large and sparse (lots of pages not visited). So representing as a .csv
is not scalable. Look at the various Vector classes in Mahout. Once you get the data into
a vector mahout can cluster the data or build a supervised classifier.

There is a very nice description of the Mahout Vector types, clustering and classification
in "Mahout in Action" a book from Manning Publishing. Read section 8.1.1 "Transforming data
into vectors", the rest of the chapter talks about clustering but a sections further along
covers classification.

On Nov 9, 2012, at 1:44 PM, qiaoresearcher <> wrote:

many thanks, i may need sometime to digest the information you

have a nice weekend,

On Fri, Nov 9, 2012 at 3:34 PM, Dmitriy Lyubimov <> wrote:

> No SGD (stochastic gradient descent) and factorization are two different
> things. More strictly, those are two different classes of problems --
> factorization and regression. SGD is one implementation for regression
> classifcation. Factorization is finding virtual factors in a user/item
> space (ALS-WR is one of the methods to find such factors).
> Yes SGD is in the book but not with your example specifically since I meant
> to apply it after you find latent variables (factors, whatever).
> You will get more help on ALS-WR method by staying on the list and also
> perhaps create an archive entry for others to follow in a similar
> situation. The idea is that we all learn together and effectively:) (and i
> score more points for support :)
> CVB (if i am not totally off) is something called continuous variational
> Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
> you to analyze content of your web pages IF you manage to grab the text off
> of them. in Mahout, it is facilitated by a package here:
> I
> don't know where exactly wiki help on CVB is, but searching mahout archive
> and stack overflow may help. Again, by staing on the list you may get more
> help with that.
> LSA (Latent semantic analysis) is another way to analyze the content of you
> web. See a wikipedia article for refresher, but basically it is a run of
> SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
> prepare that context data with seqdirectory, seq2sparse commands (again,
> you can find details in the book). Then you just run 'mahout ssvd
> <options>' on the output of seq2sparse and use rows of U*Sigma output for
> the topical allocation values. Somebody will probably correct me on this,
> but I think you can use topical allocation values to further build your
> classification with regressions (SGD).
> -d
> On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <
>> wrote:
>> Hi Dmitriy,
>> Many thanks for your comments and i really appreciate although I think I
>> may not fully understood you.
>> As I understand, SGD mean stochastic gradient descent, is that right?
>> I What I need now is some example code to :  read the files, construct
> the
>> web page set, then form the vectors. Such steps are called
> 'factorization'
>> in Mahout, right?
>> Do you mean Mahout in Action has examples similar to what I described?
>> what is CVB and LSA, and SSVD (singular value decomposition?)

View raw message