mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Xu <wenhao...@gmail.com>
Subject Re: Yahoo's LDA code
Date Fri, 10 Jun 2011 09:42:22 GMT
Yep Jake. Just go through the paper and shell scripts to call the algorithm
very quickly. It is not a map-reduce implementation of LDA, but just used
the hadoop dfs and use the mapper to run as a parallel program.

But I thought, the idea is very very useful, especially to those iterative
machine learning algorithm, in many algorithm, we might have lots of
iterations to run, and the hadoop mapreduce job will have a lots of
overhead. Like the gibbs sampling in LDA, if an mapreduce have 1 minute
overhead in setup and cleanup the job, 15 hours will be spent on the
overhead, which is not acceptable I guess.

But if we could use the advantages of the communication layer in this
implementation, for many iterative algorithm, we have have 1 iteration in
mapreduce and get tall the iteration happened inside the only 1 mapper
iteration.

Our dev team are working on a parallelized L-BFGS logistic regression these
days. Every mapper will read only the local data, and the global weight is
updated once in a map-reduce iteration. Normally, it would take 30-50
iteration to converge, if we could use the similar implementation with this
LDA implementation to eliminate the 1-2 hour overhead at least.

And I agree that, the current solution is far beyond a generic framework
based on hadoop, but really valuable to take a look, and might be very
valuable to migrate to the hadoop or mahout.

Best wishes,
Stanley Xu



On Fri, Jun 10, 2011 at 12:49 PM, Jake Mannix <jake.mannix@gmail.com> wrote:

> It's all c++, custom distributed processing, custom distributed
> coordination
> and storage.
>
> We can certainly try to port over the algorithmic ideas, but the
> distributed
> systems stuff would be a significant departure from our current setup -
> it's
> not a web service and it's not hadoop, and it's not a command line utility
> -
> it's a cluster of long-running processes all intercommunicating.  Sounds
> awesome, but that's a way's off from where we are now.
>
>  -jake
>
> On Thu, Jun 9, 2011 at 7:52 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>
> > Awesome! Guess it would be much faster than then current version in
> Mahout.
> > Is that possible to just use this version in mahout?
> >
> > On Fri, Jun 10, 2011 at 8:12 AM, <jeremy@lewi.us> wrote:
> >
> > > Yahoo released its hadoop code for LDA
> > >
> > >
> >
> http://blog.smola.org/post/6359713161/speeding-up-latent-dirichlet-allocation
> > >
> > >
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message