mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: need help on mahout
Date Fri, 09 Nov 2012 23:43:01 GMT
There is additional confusion typically because supervised and unsupervised
methods are commonly used together.  For instance, clustering
(unsupervised) can be used to generate cluster proximity features that are
then used as features for classification (supervised).

Another example might be where you use unsupervised clustering on the
labeled data including the target variable along with the other features.
 This is an unsupervised algorithm but it is used in such a way that it can
see the target variable so that it is doing a strange sort of mixed thing.
 The resulting cluster proximity features can be very high quality.

You can even do semi-supervised clustering with training data that is only
partially labeled.

It isn't surprising that these distinctions are a bit fuzzy at first.

On Fri, Nov 9, 2012 at 2:11 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:

> The confusion here may be over the term "supervised"
>
> Supervised classification assumes you know which group each user is in,
> and the classifier builds a model to classify new users into the predefined
> groups. Usually there is a classifier for each group that, when given a
> user vector, return how likely the user is a member of that group.
>
> Clustering is an unsupervised classifier which sees the groups without
> being told which user is in which group. It does this by finding structure
> in the data itself.
>
> If you don't know the groups ahead of time you want to cluster. If you are
> classifying users based on known groups of previous users you want to build
> a classifier and mahout has both.
>
> You probably need to create the vectors using Mahout code. Your matrix of
> users and pages visited could be very large and sparse (lots of pages not
> visited). So representing as a .csv is not scalable. Look at the various
> Vector classes in Mahout. Once you get the data into a vector mahout can
> cluster the data or build a supervised classifier.
>
> There is a very nice description of the Mahout Vector types, clustering
> and classification in "Mahout in Action" a book from Manning Publishing.
> Read section 8.1.1 "Transforming data into vectors", the rest of the
> chapter talks about clustering but a sections further along covers
> classification.
>
> On Nov 9, 2012, at 1:44 PM, qiaoresearcher <qiaoresearcher@gmail.com>
> wrote:
>
> many thanks, i may need sometime to digest the information you
> provide...:-)
>
> have a nice weekend,
>
>
> On Fri, Nov 9, 2012 at 3:34 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > No SGD (stochastic gradient descent) and factorization are two different
> > things. More strictly, those are two different classes of problems --
> > factorization and regression. SGD is one implementation for regression
> > classifcation. Factorization is finding virtual factors in a user/item
> > space (ALS-WR is one of the methods to find such factors).
> >
> > Yes SGD is in the book but not with your example specifically since I
> meant
> > to apply it after you find latent variables (factors, whatever).
> >
> > You will get more help on ALS-WR method by staying on the list and also
> > perhaps create an archive entry for others to follow in a similar
> > situation. The idea is that we all learn together and effectively:) (and
> i
> > score more points for support :)
> >
> > CVB (if i am not totally off) is something called continuous variational
> > Bayes implementation of LDA (Latent Dirichlet Allocation) which may help
> > you to analyze content of your web pages IF you manage to grab the text
> off
> > of them. in Mahout, it is facilitated by a package here:
> >
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/clustering/lda/cvb/package-summary.html
> > I
> > don't know where exactly wiki help on CVB is, but searching mahout
> archive
> > and stack overflow may help. Again, by staing on the list you may get
> more
> > help with that.
> >
> > LSA (Latent semantic analysis) is another way to analyze the content of
> you
> > web. See a wikipedia article for refresher, but basically it is a run of
> > SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to
> > prepare that context data with seqdirectory, seq2sparse commands (again,
> > you can find details in the book). Then you just run 'mahout ssvd
> > <options>' on the output of seq2sparse and use rows of U*Sigma output for
> > the topical allocation values. Somebody will probably correct me on this,
> > but I think you can use topical allocation values to further build your
> > classification with regressions (SGD).
> >
> > -d
> >
> >
> > On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher <qiaoresearcher@gmail.com
> >> wrote:
> >
> >> Hi Dmitriy,
> >>
> >> Many thanks for your comments and i really appreciate although I think I
> >> may not fully understood you.
> >>
> >> As I understand, SGD mean stochastic gradient descent, is that right?
> >> I What I need now is some example code to :  read the files, construct
> > the
> >> web page set, then form the vectors. Such steps are called
> > 'factorization'
> >> in Mahout, right?
> >>
> >> Do you mean Mahout in Action has examples similar to what I described?
> >> what is CVB and LSA, and SSVD (singular value decomposition?)
> >>
> >>
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message