mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jack <mrkrisj...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Fri, 18 Jun 2010 16:46:24 GMT
Thanks Ted,

I got that working.  Unfortunately, the matrix multiplication job is taking
far longer than I hoped.  With just over 10 million documents, 10 mappers
and 10 reducers, I can't get it to complete the job in under 48 hours.

Perhaps you have an idea for speeding it up?  I have already been quite
ruthless with making the vectors sparse.  I did not include terms that
appeared in over 1% of the corpus and only kept terms that appeared at least
50 times.  Is it normal that the matrix multiplication map reduce task
should take so long to process with this quantity of data and resources
available or do you think that my system is not configured properly?

Thanks,
Kris



2010/6/15 Ted Dunning <ted.dunning@gmail.com>

> Threshold are generally dangerous.  It is usually preferable to specify the
> sparseness you want (1%, 0.2%, whatever), sort the results in descending
> score order using Hadoop's builtin capabilities and just drop the rest.
>
> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mrkrisjack@gmail.com> wrote:
>
> >  I was wondering if there was an
> > interesting way to do this with the current mahout code such as
> requesting
> > that the Vector accumulator returns only elements that have values
> greater
> > than a given threshold, sorting the vector by value rather than key, or
> > something else?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message