mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jack <mrkrisj...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Fri, 18 Jun 2010 16:54:56 GMT
Thanks Sebastian, I'll give it a try!



2010/6/18 Sebastian Schelter <ssc.open@googlemail.com>

> Hi Kris,
>
> maybe you want to give the patch from
> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
> tested it with larger data yet, but I would be happy to get some
> feedback for it and maybe it helps you with your usecase.
>
> -sebastian
>
> Am 18.06.2010 18:46, schrieb Kris Jack:
> > Thanks Ted,
> >
> > I got that working.  Unfortunately, the matrix multiplication job is
> taking
> > far longer than I hoped.  With just over 10 million documents, 10 mappers
> > and 10 reducers, I can't get it to complete the job in under 48 hours.
> >
> > Perhaps you have an idea for speeding it up?  I have already been quite
> > ruthless with making the vectors sparse.  I did not include terms that
> > appeared in over 1% of the corpus and only kept terms that appeared at
> least
> > 50 times.  Is it normal that the matrix multiplication map reduce task
> > should take so long to process with this quantity of data and resources
> > available or do you think that my system is not configured properly?
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/15 Ted Dunning <ted.dunning@gmail.com>
> >
> >
> >> Threshold are generally dangerous.  It is usually preferable to specify
> the
> >> sparseness you want (1%, 0.2%, whatever), sort the results in descending
> >> score order using Hadoop's builtin capabilities and just drop the rest.
> >>
> >> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mrkrisjack@gmail.com>
> wrote:
> >>
> >>
> >>>  I was wondering if there was an
> >>> interesting way to do this with the current mahout code such as
> >>>
> >> requesting
> >>
> >>> that the Vector accumulator returns only elements that have values
> >>>
> >> greater
> >>
> >>> than a given threshold, sorting the vector by value rather than key, or
> >>> something else?
> >>>
> >>>
> >>
> >
>
>


-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message