mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Earl <charlesce...@me.com>
Subject Re: LDATopic
Date Thu, 01 Dec 2011 04:28:49 GMT
Jake,
Thanks for the pending update.
Slightly off topic, if I understand your notes on MAHOUT-897, Gibbs sampling would only be
feasible in MR implementation that support efficient iteration -- Spark, perhaps YARN -- but
not for Mahout as currently conceived. In the case of Spark, the RDD  is the shared memory
that enables faster synchronization across samplers. The need for synchronization across local
samplers may mean that Gibbs sampling is better suited for openmp.
The approach in MAHOUT-897 is understandably similar  to http://arxiv.org/pdf/1107.3765 (Using
Variational Inference and MapReduce to Scale Topic Modeling)
Do you have any recommendations on topic update that might work well (close to real time)
in practice? 
For example Yao's http://www.cs.umass.edu/~lmyao/papers/fast-topic-model10.pdf suggest simple
heuristics for identifying novel topics and memory efficient streaming update sparseLDA. I
would expect that something based on sparseLDA would be efficient for online update. 
Charles


On Nov 30, 2011, at 4:14 PM, Jake Mannix wrote:

> On Wed, Nov 30, 2011 at 1:03 PM, Isabel Drost <isabel@apache.org> wrote:
> 
>> On 28.11.2011 bish maten wrote:
>>> mahout ldatopics -i mahout-work/abc/abc-lda/state-20  -d
>>> mahout-work/abc/abc-out-seqdir-sparse-lda/dictionary.file-0  -dt
>>> sequencefile  (there were no errors reported and command worked fine with
>>> following output). Does the output appear ok?
>> 
>> Hmm - this only prints the resulting LDA topics - which command did you
>> use to
>> generate them?
>> 
>> Please also note that Jake is currently working on improving our LDA
>> support, if
>> you are interested in that algorithm it might be interesting for you to
>> look
>> into his patch in https://issues.apache.org/jira/browse/MAHOUT-897
> 
> 
> Yeah, I'm also working on moving away from LDATopic altogether, instead
> using
> VectorDumper + dictionary file and grabbing top N weighted elements in the
> vector
> representing the topic.  We already do this internally at Twitter, I just
> have to get
> that particular patch formatted properly and cleaned up once MAHOUT-897 gets
> committed (which will hopefully be this week).
> 
>  -jake


Mime
View raw message