mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Serega Sheypak <serega.shey...@gmail.com>
Subject Re: recommenditembased returns 0 records from last map-reduce job
Date Mon, 21 Jul 2014 08:09:12 GMT
Thank you for your input.


2014-07-21 12:00 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:

> My personal comments:
> 1. Data cleansing. One beautiful characteristic of Mahout’s CF
> recommendation is the simplicity of input data, often times just three
> columns (user, item, preference). If any value is missing, just don’t put
> the record in the input file. Therefore I don’t see there is any need to do
> data cleaning given that the application has recorded user-item-preference
> correctly and you have translated user-id and item-id properly.
> 2. Oftentimes Loglikelihood has a better performance than
> PearsonCorrelation in Mahout’s Collaborative Filtering. The former is
> focused on discrete values and the latter is focused on continuous values.
> Refer to Ted’s popular post Surprise and Coincidence about the former.
>
>
> Peng Zhang
> pzhang.xjtu@gmail.com
>
>
>
>
>
> On Jul 21, 2014, at 3:37 PM, Serega Sheypak <serega.sheypak@gmail.com>
> wrote:
>
> > Thanks! I'll report this evening.
> >
> > Are there any articles about data preparation for mahout item
> > recommendation? There are many books but most of them are copy-paste of
> > javadoc and guides from mahout site.
> > I'm -1 at math, my challenges are:
> >
> > 1. approaches for data cleaning, do I have to apply dead-simple
> statisical
> > rules?
> > "The empirical rule also states that approximately 95 percent of the data
> > values will fall within two standard deviations from the mean."
> > So If my user visits are described as normal distirbution Does it make
> > sense? The idea is to put away all noise.
> >
> > 2. similarityClassname - don't have any intuition here... I see that
> people
> > use SIMILARITY_LOGLIKELIHOOD and PEARSON
> >
> >
> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
> >
> >> Serega,
> >>
> >> See the last line on how to pass outputPathForSimilarityMatrix options
> to
> >> the recommenditembased command:
> >>
> >> sudo -u oozie mahout recommenditembased \
> >>                   --input visited_items_with_inverted_items \
> >>
> >>                   --output result \
> >>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> >>                   --usersFile inverted_items \
> >>                   --numRecommendations 500 \
> >>                   --booleanData false \
> >>                   --maxPrefsPerUser 100 \
> >>                   --maxSimilaritiesPerItem 500 \
> >>                   --minPrefsPerUser 0\
> >>                   --maxPrefsPerUserInItemSimilarity 30 \
> >>                   --threshold 0.91 \
> >>                   --tempDir  temp \
> >>                   --outputPathForSimilarityMatrix similarityMatri \
> >>
> >>
> >> Peng Zhang
> >> pzhang.xjtu@gmail.com
> >>
> >>
> >>
> >>
> >>
> >> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <serega.sheypak@gmail.com>
> >> wrote:
> >>
> >>> I've inspected the code, our approach wouldn't work with
> >> booleanData=false.
> >>> We do calcualte imte similarity in the wrong way...(((
> >>> Thank you
> >>> 1. We provide "fake" user_id and provide --usersFile in order to get
> >>> recommendations for "fake user_id, where user_id is a negative item_id.
> >> It
> >>> worked when we did provide user_id->item_id pairs without preference.
> >>> 2. Our target is to get item similarities. We tried
> >>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
> >> it
> >>> returns bad result comparing to RecommenderJob with our "fake" user_id
> >>> (inverted item_id)
> >>>
> >>> 1. I'll try the option you provided.
> >>> 2. I will remove input with fake user_id and usersFile with these fake
> >> ids
> >>>
> >>> 3.
> >>>
> >>
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> >>> I don't understand how to pass ---outputPathForSimilarityMatrix option
> to
> >>> RecommenderJob
> >>>
> >>>
> >>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
> >>>
> >>>> Seraga,
> >>>>
> >>>> I have two comments:
> >>>> 1. Don’t use negative user ids. Since Mahout uses user id as well
as
> >> item
> >>>> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> >>>> 2. If you want to get the item similarity information, you can use
> >>>> --outputPathForSimilarityMatrix in the command
> >>>>
> >>>> Regards,
> >>>> Peng Zhang
> >>>> M: +86 186-1658-7856
> >>>> pzhang.xjtu@gmail.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <serega.sheypak@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> All bad things happen here:
> >>>>>
> >>>>>
> >>>>>
> >>>>> Name
> >>>>>
> >>>>> RecommenderJob-PartialMultiplyMapper-Reducer
> >>>>>
> >>>>> User
> >>>>>
> >>>>> oozie
> >>>>>
> >>>>> Process User
> >>>>>
> >>>>> oozie
> >>>>>
> >>>>> Group
> >>>>>
> >>>>> oozie
> >>>>>
> >>>>> Mapper Class
> >>>>>
> >>>>> PartialMultiplyMapper
> >>>>>
> >>>>> Reducer Class
> >>>>>
> >>>>> AggregateAndRecommendReducer
> >>>>>
> >>>>>
> >>>>> Job Input Directory
> >>>>>
> >>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
> >>>>>
> >>>>> Job Output Directory
> >>>>>
> >>>>> hdfs://nameservice1/itemrec/output/
> >>>>>
> >>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
> records=3312879
> >>>>>
> >>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
> records=3313251
> >>>>>
> >>>>>
> >>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> >> records=3313251
> >>>>>
> >>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
> >>>>>
> >>>>> Why does mahout returns 0 rows? it works when booleanData=true
> >>>> (preferences
> >>>>> are ignored...?)
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com
> >:
> >>>>>
> >>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> >>>>>> users_file:
> >>>>>> --inverted_item_id
> >>>>>> -1
> >>>>>> -2
> >>>>>> -3
> >>>>>> -4
> >>>>>>
> >>>>>> users_items_prefs
> >>>>>> --inverted item_id
> >>>>>> -1 1 1.0
> >>>>>> -2 2 1.0
> >>>>>> -3 3 1.0
> >>>>>> -4 4 1.0
> >>>>>> --user_id item_id pref_value
> >>>>>> 11   1 1.6
> >>>>>> 11   2 1.6
> >>>>>> 123 3 2.0
> >>>>>> 123 4 2.0
> >>>>>> 333 1 2.0
> >>>>>> 333 2 1.6
> >>>>>> --e.t.c.
> >>>>>>
> >>>>>> if I set --booleanData true
> >>>>>> then mahout returns the result.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> >> andrew.musselman@gmail.com
> >>>>> :
> >>>>>>
> >>>>>> I'm confused about how you're constructing the user file, and
why
> >> there
> >>>>>>> are negated item ids here.
> >>>>>>>
> >>>>>>> Can you post some more details please, including Mahout
version and
> >>>> some
> >>>>>>> sample data sets?
> >>>>>>>
> >>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> >>>> serega.sheypak@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi, I'm trying to create item similarity.
> >>>>>>>> I gather items which users visit during shopping and
then create a
> >>>> file:
> >>>>>>>> user_id, item_id, weight (where weight can be: [1.0,
1.6, 1.9],
> >>>> depends
> >>>>>>> on
> >>>>>>>> user action type and data source)
> >>>>>>>> UNION
> >>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
> >>>>>>>>
> >>>>>>>> and I do provide a userFile, where user_id = -item_id
> >>>>>>>>
> >>>>>>>> The idea is to get item similary. If any user visits
item named
> >> "A", i
> >>>>>>> want
> >>>>>>>> to show him items "B", "c", "xxx" using preferences
of other
> users.
> >>>>>>>>
> >>>>>>>> The problem is that the last (???) mapreduce job returns
0 rows:
> >>>>>>>>
> >>>>>>>> Here are my settings:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>>>                 --input visited_items_with_inverted_items
\
> >>>>>>>>
> >>>>>>>>                 --output result \
> >>>>>>>>                 --similarityClassname SIMILARITY_LOGLIKELIHOOD
\
> >>>>>>>>                 --usersFile inverted_items \
> >>>>>>>>                 --numRecommendations 500 \
> >>>>>>>>                 --booleanData false \
> >>>>>>>>                 --maxPrefsPerUser 100 \
> >>>>>>>>                 --maxSimilaritiesPerItem 500 \
> >>>>>>>>                 --minPrefsPerUser 0\
> >>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30
\
> >>>>>>>>                 --threshold 0.91 \
> >>>>>>>>                 --tempDir  temp \
> >>>>>>>>
> >>>>>>>> Some counters... I don't get what do they mean....
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> >>>>>>>>
> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
> >>>>>>>>
> >>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>> USER_RATINGS_USED=12,429,693
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> >>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>
> >>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>>>
> >>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> COOCCURRENCES=35882374
> >>>>>>>>
> >>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> PRUNED_COOCCURRENCES=0
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> >> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> >>>> records=17570268
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce
input
> >>>>>>> records=5221907
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce
output
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce
input
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce
output
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce
input
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce
output
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> >> records=7528530
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> >>>> records=3313251
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce
input
> >>>>>>> records=3313251
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce
output
> >>>>>>> records=3313251
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> >> records=6626130
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> >>>> records=6626130
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce
input
> >>>>>>> records=6626130
> >>>>>>>>
> >>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce
output
> >>>>>>> records=3312879
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> >> records=3312879
> >>>>>>>>
> >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> >>>> records=3313251
> >>>>>>>>
> >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce
input
> >>>>>>> records=3313251
> >>>>>>>>
> >>>>>>>> --------
> >>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce
output
> records=0
> >>>>>>>> --------
> >>>>>>>>
> >>>>>>>> why 0???
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message