mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: recommenditembased returns 0 records from last map-reduce job
Date Fri, 25 Jul 2014 18:39:10 GMT
Sorry I haven’t read this thread carefully but it looks like you may be using the wrong IDs.

For most Mahout jobs you have to prepare you data to have Mahout IDs. You do this by looking
at each datum and as you see a new unique application specific user or item ID you give it
a Mahout ID starting from 0. So Mahout ID can be thought of as row and column numbers in a
matrix. The Mahout IDs for rows will be 0 thru # of rows-1 same for columns.

This always requires that you translate into Mahout IDs then after the job is run translate
back into your application IDs. You need a bi-directional dictionary of some type. I use a
HashBiMap from Guava.

Also I’d avoid the threshold for now. If you get that wrong it will mess things up badly
and is very hard to tune. It’s there for completeness but I never use it.


On Jul 25, 2014, at 12:55 AM, Serega Sheypak <serega.sheypak@gmail.com> wrote:

Hi, nothing helps...
I do use mahout 0.9 compiled for CDH 4.7
I do provide only positive values
I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
items
Input data is:
16*10^6 preferences
4*10^6 users
0.6*10^ items
I do use perason correlation and preferece vlaues are: 1.0 and 2.0


2014-07-22 9:32 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com>:

> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
> Right now I don't see how can it help me. As far as I know the stuff I try
> to use is pretty old and stable.
> looks like I do apply it in a wrong way.
> 
> There is an option for recommenditembased named "--threshold". I do
> provide data for recommenditembased with preference values in range
> [1.1..2.0].
> I set --threshold to 1.2
> --threshold is absolute and can be from [1.1 . .2+] or it's relative and
> can be [0.0 .. 0.99999]?
> 
> 
> 2014-07-22 3:54 GMT+04:00 Ted Dunning <ted.dunning@gmail.com>:
> 
> That version is no longer supported.  You should upgrade to 0.9
>> 
>> 
>> 
>> 
>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
>> serega.sheypak@gmail.com>
>> wrote:
>> 
>>> 0.7-cdh4.7.0
>>> Anyway, recommenditembased does produce these catalogs:
>>> 
>>> /recommenditembased/temp/maxValues.bin
>>> /recommenditembased/temp/norms.bin
>>> /recommenditembased/temp/numNonZeroEntries.bin
>>> /recommenditembased/temp/pairwiseSimilarity
>>> /recommenditembased/temp/partialMultiply
>>> /recommenditembased/temp/prePartialMultiply1
>>> /recommenditembased/temp/prePartialMultiply2
>>> /recommenditembased/temp/preparePreferenceMatrix
>>> /recommenditembased/temp/similarityMatrix
>>> /recommenditembased/temp/weights
>>> 
>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>> In
>>> eed. Right now I try to read it using
>>> 
>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>>    '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>>    '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>> )  as (intId: int, vector:tuple(cardinality:int,
>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>> 
>>> 
>>> Looks like the vector is empty... Or i do something wrong.
>>> 
>>> 
>>> 
>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <ted.dunning@gmail.com>:
>>> 
>>>> Which version of Mahout?
>>>> 
>>>> 
>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>> serega.sheypak@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>>> processing
>>>>> Job-Specific
>>>>> 
>>>>> sudo -u hdfs hadoop fs -rm -r
>>>> hdfs://nameservice1/recommenditembased/output
>>>>> sudo -u hdfs hadoop fs -rm -r
>>> hdfs://nameservice1/recommenditembased/temp
>>>>> sudo -u oozie mahout recommenditembased \
>>>>>                    --input \
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>>>> \
>>>>>                    --output \
>>>>>                    hdfs://nameservice1/recommenditembased/output \
>>>>>                    --similarityClassname \
>>>>>                    SIMILARITY_LOGLIKELIHOOD \
>>>>>                   --numRecommendations \
>>>>>                    500 \
>>>>>                    --booleanData \
>>>>>                    false \
>>>>>                    --maxPrefsPerUser \
>>>>>                    1000 \
>>>>>                    --maxSimilaritiesPerItem \
>>>>>                    1000 \
>>>>>                    --minPrefsPerUser \
>>>>>                    5 \
>>>>>                    --maxPrefsPerUserInItemSimilarity \
>>>>>                    30 \
>>>>>                    --threshold \
>>>>>                   1.1 \
>>>>>                    --tempDir \
>>>>>                    hdfs://nameservice1/recommenditembased/temp \
>>>>>                    --outputPathForSimilarityMatrix \
>>>>> 
>> hdfs://nameservice1/recommenditembased/sim_matrix
>>>>> 
>>>>> 
>>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>>>> 
>>>>> 
>>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
>>>>> 
>>>>>> Serega,
>>>>>> 
>>>>>> See the last line on how to pass outputPathForSimilarityMatrix
>>> options
>>>> to
>>>>>> the recommenditembased command:
>>>>>> 
>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>                   --input visited_items_with_inverted_items \
>>>>>> 
>>>>>>                   --output result \
>>>>>>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD
>> \
>>>>>>                   --usersFile inverted_items \
>>>>>>                   --numRecommendations 500 \
>>>>>>                   --booleanData false \
>>>>>>                   --maxPrefsPerUser 100 \
>>>>>>                   --maxSimilaritiesPerItem 500 \
>>>>>>                   --minPrefsPerUser 0\
>>>>>>                   --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>                   --threshold 0.91 \
>>>>>>                   --tempDir  temp \
>>>>>>                   --outputPathForSimilarityMatrix
>> similarityMatri \
>>>>>> 
>>>>>> 
>>>>>> Peng Zhang
>>>>>> pzhang.xjtu@gmail.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>> serega.sheypak@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I've inspected the code, our approach wouldn't work with
>>>>>> booleanData=false.
>>>>>>> We do calcualte imte similarity in the wrong way...(((
>>>>>>> Thank you
>>>>>>> 1. We provide "fake" user_id and provide --usersFile in order
to
>>> get
>>>>>>> recommendations for "fake user_id, where user_id is a negative
>>>> item_id.
>>>>>> It
>>>>>>> worked when we did provide user_id->item_id pairs without
>>> preference.
>>>>>>> 2. Our target is to get item similarities. We tried
>>>>>>> 
>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>>> but
>>>>>> it
>>>>>>> returns bad result comparing to RecommenderJob with our "fake"
>>>> user_id
>>>>>>> (inverted item_id)
>>>>>>> 
>>>>>>> 1. I'll try the option you provided.
>>>>>>> 2. I will remove input with fake user_id and usersFile with
>> these
>>>> fake
>>>>>> ids
>>>>>>> 
>>>>>>> 3.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix
>>>> option
>>>>> to
>>>>>>> RecommenderJob
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
>>>>>>> 
>>>>>>>> Seraga,
>>>>>>>> 
>>>>>>>> I have two comments:
>>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user
id as
>> well
>>> as
>>>>>> item
>>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc
as
>> ids
>>>>>>>> 2. If you want to get the item similarity information, you
can
>> use
>>>>>>>> --outputPathForSimilarityMatrix in the command
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Peng Zhang
>>>>>>>> M: +86 186-1658-7856
>>>>>>>> pzhang.xjtu@gmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> All bad things happen here:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Name
>>>>>>>>> 
>>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>>>>>>> 
>>>>>>>>> User
>>>>>>>>> 
>>>>>>>>> oozie
>>>>>>>>> 
>>>>>>>>> Process User
>>>>>>>>> 
>>>>>>>>> oozie
>>>>>>>>> 
>>>>>>>>> Group
>>>>>>>>> 
>>>>>>>>> oozie
>>>>>>>>> 
>>>>>>>>> Mapper Class
>>>>>>>>> 
>>>>>>>>> PartialMultiplyMapper
>>>>>>>>> 
>>>>>>>>> Reducer Class
>>>>>>>>> 
>>>>>>>>> AggregateAndRecommendReducer
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Job Input Directory
>>>>>>>>> 
>>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>>>>>>> 
>>>>>>>>> Job Output Directory
>>>>>>>>> 
>>>>>>>>> hdfs://nameservice1/itemrec/output/
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>>>>> records=3312879
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>>>>> records=3313251
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>>>>>> records=3313251
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>>>> records=0
>>>>>>>>> 
>>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true
>>>>>>>> (preferences
>>>>>>>>> are ignored...?)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>>> serega.sheypak@gmail.com
>>>>>> :
>>>>>>>>> 
>>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>>>>>>>> users_file:
>>>>>>>>>> --inverted_item_id
>>>>>>>>>> -1
>>>>>>>>>> -2
>>>>>>>>>> -3
>>>>>>>>>> -4
>>>>>>>>>> 
>>>>>>>>>> users_items_prefs
>>>>>>>>>> --inverted item_id
>>>>>>>>>> -1 1 1.0
>>>>>>>>>> -2 2 1.0
>>>>>>>>>> -3 3 1.0
>>>>>>>>>> -4 4 1.0
>>>>>>>>>> --user_id item_id pref_value
>>>>>>>>>> 11   1 1.6
>>>>>>>>>> 11   2 1.6
>>>>>>>>>> 123 3 2.0
>>>>>>>>>> 123 4 2.0
>>>>>>>>>> 333 1 2.0
>>>>>>>>>> 333 2 1.6
>>>>>>>>>> --e.t.c.
>>>>>>>>>> 
>>>>>>>>>> if I set --booleanData true
>>>>>>>>>> then mahout returns the result.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>>>>> andrew.musselman@gmail.com
>>>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>> I'm confused about how you're constructing the user
file, and
>>> why
>>>>>> there
>>>>>>>>>>> are negated item ids here.
>>>>>>>>>>> 
>>>>>>>>>>> Can you post some more details please, including
Mahout
>> version
>>>> and
>>>>>>>> some
>>>>>>>>>>> sample data sets?
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak
<
>>>>>>>> serega.sheypak@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi, I'm trying to create item similarity.
>>>>>>>>>>>> I gather items which users visit during shopping
and then
>>>> create a
>>>>>>>> file:
>>>>>>>>>>>> user_id, item_id, weight (where weight can
be: [1.0, 1.6,
>>> 1.9],
>>>>>>>> depends
>>>>>>>>>>> on
>>>>>>>>>>>> user action type and data source)
>>>>>>>>>>>> UNION
>>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>>>>>>>>>> 
>>>>>>>>>>>> and I do provide a userFile, where user_id
= -item_id
>>>>>>>>>>>> 
>>>>>>>>>>>> The idea is to get item similary. If any
user visits item
>>> named
>>>>>> "A", i
>>>>>>>>>>> want
>>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences
of
>> other
>>>>> users.
>>>>>>>>>>>> 
>>>>>>>>>>>> The problem is that the last (???) mapreduce
job returns 0
>>> rows:
>>>>>>>>>>>> 
>>>>>>>>>>>> Here are my settings:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>>>>>>                 --input visited_items_with_inverted_items
>> \
>>>>>>>>>>>> 
>>>>>>>>>>>>                 --output result \
>>>>>>>>>>>>                 --similarityClassname
>>> SIMILARITY_LOGLIKELIHOOD
>>>> \
>>>>>>>>>>>>                 --usersFile inverted_items
\
>>>>>>>>>>>>                 --numRecommendations 500
\
>>>>>>>>>>>>                 --booleanData false \
>>>>>>>>>>>>                 --maxPrefsPerUser 100 \
>>>>>>>>>>>>                 --maxSimilaritiesPerItem
500 \
>>>>>>>>>>>>                 --minPrefsPerUser 0\
>>>>>>>>>>>>                 --maxPrefsPerUserInItemSimilarity
30 \
>>>>>>>>>>>>                 --threshold 0.91 \
>>>>>>>>>>>>                 --tempDir  temp \
>>>>>>>>>>>> 
>>>>>>>>>>>> Some counters... I don't get what do they
mean....
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
    USERS=7528530
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>> USER_RATINGS_USED=12,429,693
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
    ROWS=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> COOCCURRENCES=35882374
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> PRUNED_COOCCURRENCES=0
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Map input
>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Map output
>>>>>>>> records=17570268
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Reduce input
>>>>>>>>>>> records=5221907
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce input
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce input
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Map input
>>>>>> records=7528530
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Map output
>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Reduce input
>>>>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Reduce output
>>>>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Map input
>>>>>> records=6626130
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Map output
>>>>>>>> records=6626130
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Reduce input
>>>>>>>>>>> records=6626130
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Map input
>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Map output
>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Reduce input
>>>>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> --------
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Reduce output
>>>>> records=0
>>>>>>>>>>>> --------
>>>>>>>>>>>> 
>>>>>>>>>>>> why 0???
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 


Mime
View raw message