mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Levy, Mark" <m...@last.fm>
Subject RE: Cluster text docs
Date Fri, 18 Dec 2009 15:03:23 GMT
Hi Drew,

Below is a mail I sent to this list a while back.  Is this consistent with your experience?

Cheers,

Mark


On Sep 23, 2009, at 6:05 AM, Levy, Mark wrote:

> I've started to experiment with LDA and am finding that it creates  
> only
> a single long-running map task for each iteration, which doesn't scale
> well.  The map is taking 20mins for 10k of my input SparseVectors,  
> and 5
> hours for 100k (the vocabulary size also grows when there are more
> vectors).
>
> Is this expected or am I doing something wrong?  Are there any  
> existing
> performance benchmarks?
>


> -----Original Message-----
> From: Drew Farris [mailto:drew.farris@gmail.com]
> Sent: 18 December 2009 13:59
> To: mahout-user@lucene.apache.org
> Subject: Re: Cluster text docs
> 
> Hi Shashi,
> 
> On Fri, Dec 18, 2009 at 1:36 AM, Shashikant Kore <shashikant@gmail.com>
> wrote:
> 
> > (.. cluster assignment is already there. Wonder why you had to redo
> > it.)
> 
> Ahh, yes. I didn't have to re-do it, but I did wanted to learn the
> internal structure of the data files and to point out that it was easy
> enough to achieve. The code is quite straightforward.
> 
> > Drew, are you using the latest code? Overnight sounds too long.
> 
> That's good to know. This was a couple month or two ago before the
> matrix/math stuff was rolled in. I'll collect exact times on the next
> run I do.
> 
> Has anyone else run LDA outside of the canned Reuters example? I would
> be interested to hear about corpus characteristics and processing
> power required to successfully produce LDA clusters. I've had all
> sorts of issues, but mostly related to hadoop configuration nits
> related to my environment however
> 
> Drew

Mime
View raw message