mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <ssc.o...@googlemail.com>
Subject Re: Generating a Document Similarity Matrix
Date Fri, 18 Jun 2010 16:51:17 GMT
Hi Kris,

maybe you want to give the patch from
https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
tested it with larger data yet, but I would be happy to get some
feedback for it and maybe it helps you with your usecase.

-sebastian

Am 18.06.2010 18:46, schrieb Kris Jack:
> Thanks Ted,
>
> I got that working.  Unfortunately, the matrix multiplication job is taking
> far longer than I hoped.  With just over 10 million documents, 10 mappers
> and 10 reducers, I can't get it to complete the job in under 48 hours.
>
> Perhaps you have an idea for speeding it up?  I have already been quite
> ruthless with making the vectors sparse.  I did not include terms that
> appeared in over 1% of the corpus and only kept terms that appeared at least
> 50 times.  Is it normal that the matrix multiplication map reduce task
> should take so long to process with this quantity of data and resources
> available or do you think that my system is not configured properly?
>
> Thanks,
> Kris
>
>
>
> 2010/6/15 Ted Dunning <ted.dunning@gmail.com>
>
>   
>> Threshold are generally dangerous.  It is usually preferable to specify the
>> sparseness you want (1%, 0.2%, whatever), sort the results in descending
>> score order using Hadoop's builtin capabilities and just drop the rest.
>>
>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mrkrisjack@gmail.com> wrote:
>>
>>     
>>>  I was wondering if there was an
>>> interesting way to do this with the current mahout code such as
>>>       
>> requesting
>>     
>>> that the Vector accumulator returns only elements that have values
>>>       
>> greater
>>     
>>> than a given threshold, sorting the vector by value rather than key, or
>>> something else?
>>>
>>>       
>>     
>   


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message