mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan José Ramos <jjar...@gmail.com>
Subject Re: Load output of rowsimilarity to memory
Date Tue, 25 Feb 2014 09:52:10 GMT
Thanks for the answer.

That was the approach I had in mind in the first place the only difference
would be that I will write the output to a file that can be later used to
create a FileItemSimilarity.

I think that would be a very nice feature to have in the API.

Thanks again.


On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter <ssc@apache.org> wrote:

> I overlooked that you're interested in document similarities. Sry again :)
>
> Another way would be to read the output of RowSimilarityJob with a
> o.a.m.common.iterator.sequencefile.SequenceFileDirIterable
>
> You create a list of instances of o.a.m.cf.taste.impl.similarity.
> GenericItemSimilarity.ItemItemSimilarity
>
> e.g. for the output
>
>
> Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}
>
> you would do
>
> list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016));
> list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565));
> ...
>
> After that you create a GenericItemSimilarity from the list of
> ItemItemSimilarities, which is the in-memory item similarity you asked for.
>
> Hope that helps,
> Sebastian
>
>
>
> On 02/24/2014 10:04 PM, Juan José Ramos wrote:
>
>> Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be
>> for
>> item-based CF? In particular, in the documentation I can read that:
>> Preferences in the input file should look like
>> userID,itemID[,preferencevalue]
>>
>> And in my case the input I have is just text documents and I want to
>> pre-compute similarities between them beforehand, even before any user has
>> expressed any preference value for any item.
>>
>> In order to use ItemSimilarityJob for this purpose, what should be the
>> input I need to provide? Would it be the output of seq2sparse?
>>
>> Thanks again.
>>
>>
>> On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter <ssc@apache.org>
>> wrote:
>>
>>  You're right, my bad. If you don't use RowSimilarityJob directly, but
>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>> (which calls RowSimilarityJob under the covers), your output will be a
>>> textfile that is directly usable with FileItemSimilarity.
>>>
>>> --sebastian
>>>
>>>
>>> On 02/24/2014 09:30 PM, Juan José Ramos wrote:
>>>
>>>  Thanks for the prompt reply.
>>>>
>>>> RowSimilarityJob produces an output in the form of:
>>>> Key: 0: Value: {61112:0.21139380179557016,
>>>> 52144:0.23797846026935565,...}
>>>>
>>>> whereas FileItemSimilarity is expecting a comma or tab separated inputs.
>>>>
>>>> I assume that you meant that the output of RowSimilarityJob can be
>>>> loaded
>>>> by the FileItemSimilarity after doing the appropriate parsing. Is that
>>>> correct, or is there actually a way to load the raw output of
>>>> RowSimilarityJob into FileItemSimilarity?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter <ssc@apache.org>
>>>> wrote:
>>>>
>>>>   The output of RowSimilarityJob can be loaded by the
>>>> FileItemSimilarity.
>>>>
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 02/24/2014 08:31 PM, Juan José Ramos wrote:
>>>>>
>>>>>   Is there a way to reproduce this process:
>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/
>>>>>> Quick+tour+of+text+analysis+using+the+Mahout+command+line
>>>>>>
>>>>>> inside Java code and not using the command line tool? I am not
>>>>>> interested
>>>>>> in the clustering part but in 'Calculate several similar docs to
each
>>>>>> doc
>>>>>> in the data'. In particular, I am interested in loading the output
of
>>>>>> the
>>>>>> rowsimilarity tool into memory to be used as my custom ItemSimilarity
>>>>>> implementation for an ItemBasedRecommender.
>>>>>>
>>>>>> What I exactly want is to have a matrix in memory where for every
doc
>>>>>> in
>>>>>> my
>>>>>> catalogue I have the similarity with the 100 (that is the threshold
I
>>>>>> am
>>>>>> using) most similar items an undefined similarity for the rest.
>>>>>>
>>>>>> Is it possible to do with the Java API? I know it can be done calling
>>>>>> the
>>>>>> commands from inside the Java code and I guess that also using
>>>>>> corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix
>>>>>> and
>>>>>> RowItemSimilarityJob. But I still see cannot see an easy way of
>>>>>> parsing
>>>>>> the
>>>>>> output of RowItemSimilarityJob to the memory representation I intend
>>>>>> to
>>>>>> use.
>>>>>>
>>>>>> Thanks a lot.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message