mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Labeled LDA
Date Wed, 19 Sep 2012 20:03:22 GMT
Labeled LDA does not exist in Mahout *as published* in 0.7, but it's got a
close variant in the
fork on Github <https://github.com/twitter/mahout> which Twitter has been
working with:

  In 0.7, we allow training to specify any seed model (i.e. a matrix of
latent topic to term counts)
which it uses to start with (if you don't specify one, it starts random,
but you are welcome to build
up your own matrix of "informed priors" on term distributions for each
topic).  This doesn't get you
anything like L-LDA, but on the Github fork, we also allow you to specify
priors on the
document/topic distribution: you take your set of input documents, and if
each one has some known
set of labels associated with it, you then take as a prior for p(topic) for
this document to be not
random (or uniform across all topics) but uniform across the known labels.

  Labeled LDA further constrains that when you do training, you force
p(topic | doc_i) = 0 for
all topics outside of the label set for doc_i, which we don't implement
currently (even on
the Github fork), although it would be easy enough to implement.  We allow
the document
distributions to drift freely after the initial prior is applied, which
leads to something like
an intermediate algorithm between regular LDA and L-LDA.

  To get "true" L-LDA, the code you'd want to modify is in
here<https://github.com/twitter/mahout/blob/master/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0PriorMapper.java>.
 Before the train() is called (line 108),
you'd want to keep a copy of the docTopicPrior vector, keeping note of
which topics had zero
probability, and then before the final line in the map() method, you'd want
to zero-out the entries
in the updated docTopicPrior vector that should be zero and renormalize it
before emitting.

  If you want to try this out, please let me know how it goes, and I'd be
happy to accept your
pull request to add this! :)


On Wed, Sep 19, 2012 at 7:42 AM, Salman Mahmood <salman@influestor.com>wrote:

> Oh and L-LDA is not implemented in Mahout. Atleast not in 0.7 release.
> Would be nice if it is available in further releases.
> On Sep 19, 2012, at 3:28 PM, Andrea Di Menna wrote:
>
> > Hello,
> >
> > I found somewhere in the mailing archives (actually here
> > http://www.mail-archive.com/user@mahout.apache.org/msg07138.html) that
> Jake
> > Mannix was planning to work on L-LDA for Mahout.
> > But I don't seem to find anything in the source code (I may be looking in
> > the wrong direction though...).
> >
> > Any help?
> >
> > Cheers
> > Andrea
> >
> >
> >
> >
> > This e-mail is only intended for the person(s) to whom it is addressed
> and may contain CONFIDENTIAL information. Any opinions or views are
> personal to the writer and do not represent those of INQ Mobile Limited,
> Hutchison Whampoa Limited or its group companies.  If you  are not the
> intended recipient, you are hereby notified that any use, retention,
> disclosure, copying, printing, forwarding or dissemination of this
> communication is strictly prohibited. If you have received this
>  communication in error, please erase all copies of the message and its
>  attachments and notify the sender immediately. INQ Mobile Limited is  a
> company registered in the British Virgin Islands. www.inqmobile.com.
> >
>
>


-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message