mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan José Ramos <jjar...@gmail.com>
Subject Re: Load output of rowsimilarity to memory
Date Mon, 24 Feb 2014 21:04:20 GMT
Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be for
item-based CF? In particular, in the documentation I can read that:
Preferences in the input file should look like
userID,itemID[,preferencevalue]

And in my case the input I have is just text documents and I want to
pre-compute similarities between them beforehand, even before any user has
expressed any preference value for any item.

In order to use ItemSimilarityJob for this purpose, what should be the
input I need to provide? Would it be the output of seq2sparse?

Thanks again.


On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter <ssc@apache.org> wrote:

> You're right, my bad. If you don't use RowSimilarityJob directly, but
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> (which calls RowSimilarityJob under the covers), your output will be a
> textfile that is directly usable with FileItemSimilarity.
>
> --sebastian
>
>
> On 02/24/2014 09:30 PM, Juan José Ramos wrote:
>
>> Thanks for the prompt reply.
>>
>> RowSimilarityJob produces an output in the form of:
>> Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}
>>
>> whereas FileItemSimilarity is expecting a comma or tab separated inputs.
>>
>> I assume that you meant that the output of RowSimilarityJob can be loaded
>> by the FileItemSimilarity after doing the appropriate parsing. Is that
>> correct, or is there actually a way to load the raw output of
>> RowSimilarityJob into FileItemSimilarity?
>>
>> Thanks.
>>
>>
>> On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter <ssc@apache.org>
>> wrote:
>>
>>  The output of RowSimilarityJob can be loaded by the FileItemSimilarity.
>>>
>>> --sebastian
>>>
>>>
>>> On 02/24/2014 08:31 PM, Juan José Ramos wrote:
>>>
>>>  Is there a way to reproduce this process:
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/
>>>> Quick+tour+of+text+analysis+using+the+Mahout+command+line
>>>>
>>>> inside Java code and not using the command line tool? I am not
>>>> interested
>>>> in the clustering part but in 'Calculate several similar docs to each
>>>> doc
>>>> in the data'. In particular, I am interested in loading the output of
>>>> the
>>>> rowsimilarity tool into memory to be used as my custom ItemSimilarity
>>>> implementation for an ItemBasedRecommender.
>>>>
>>>> What I exactly want is to have a matrix in memory where for every doc in
>>>> my
>>>> catalogue I have the similarity with the 100 (that is the threshold I am
>>>> using) most similar items an undefined similarity for the rest.
>>>>
>>>> Is it possible to do with the Java API? I know it can be done calling
>>>> the
>>>> commands from inside the Java code and I guess that also using
>>>> corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and
>>>> RowItemSimilarityJob. But I still see cannot see an easy way of parsing
>>>> the
>>>> output of RowItemSimilarityJob to the memory representation I intend to
>>>> use.
>>>>
>>>> Thanks a lot.
>>>>
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message