mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Serega Sheypak <serega.shey...@gmail.com>
Subject Re: recommenditembased returns 0 records from last map-reduce job
Date Fri, 25 Jul 2014 07:55:45 GMT
Hi, nothing helps...
I do use mahout 0.9 compiled for CDH 4.7
I do provide only positive values
I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
items
Input data is:
16*10^6 preferences
4*10^6 users
0.6*10^ items
I do use perason correlation and preferece vlaues are: 1.0 and 2.0


2014-07-22 9:32 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com>:

> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
> Right now I don't see how can it help me. As far as I know the stuff I try
> to use is pretty old and stable.
> looks like I do apply it in a wrong way.
>
> There is an option for recommenditembased named "--threshold". I do
> provide data for recommenditembased with preference values in range
> [1.1..2.0].
> I set --threshold to 1.2
> --threshold is absolute and can be from [1.1 . .2+] or it's relative and
> can be [0.0 .. 0.99999]?
>
>
> 2014-07-22 3:54 GMT+04:00 Ted Dunning <ted.dunning@gmail.com>:
>
> That version is no longer supported.  You should upgrade to 0.9
>>
>>
>>
>>
>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
>> serega.sheypak@gmail.com>
>> wrote:
>>
>> > 0.7-cdh4.7.0
>> > Anyway, recommenditembased does produce these catalogs:
>> >
>> > /recommenditembased/temp/maxValues.bin
>> > /recommenditembased/temp/norms.bin
>> > /recommenditembased/temp/numNonZeroEntries.bin
>> > /recommenditembased/temp/pairwiseSimilarity
>> > /recommenditembased/temp/partialMultiply
>> > /recommenditembased/temp/prePartialMultiply1
>> > /recommenditembased/temp/prePartialMultiply2
>> > /recommenditembased/temp/preparePreferenceMatrix
>> > /recommenditembased/temp/similarityMatrix
>> > /recommenditembased/temp/weights
>> >
>> > I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>> In
>> > eed. Right now I try to read it using
>> >
>> > matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>> >  com.twitter.elephantbird.pig.load.SequenceFileLoader(
>> >     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>> >     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> > )  as (intId: int, vector:tuple(cardinality:int,
>> > entries:bag{t:tuple(some_id:long, some_value:double)}));
>> >
>> >
>> > Looks like the vector is empty... Or i do something wrong.
>> >
>> >
>> >
>> > 2014-07-21 22:09 GMT+04:00 Ted Dunning <ted.dunning@gmail.com>:
>> >
>> > > Which version of Mahout?
>> > >
>> > >
>> > > On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>> > serega.sheypak@gmail.com
>> > > >
>> > > wrote:
>> > >
>> > > > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>> > > processing
>> > > > Job-Specific
>> > > >
>> > > > sudo -u hdfs hadoop fs -rm -r
>> > > hdfs://nameservice1/recommenditembased/output
>> > > > sudo -u hdfs hadoop fs -rm -r
>> > hdfs://nameservice1/recommenditembased/temp
>> > > > sudo -u oozie mahout recommenditembased \
>> > > >                     --input \
>> > > >
>> > > >
>> > > >
>> > >
>> >
>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>> > > > \
>> > > >                     --output \
>> > > >                     hdfs://nameservice1/recommenditembased/output
\
>> > > >                     --similarityClassname \
>> > > >                     SIMILARITY_LOGLIKELIHOOD \
>> > > >                    --numRecommendations \
>> > > >                     500 \
>> > > >                     --booleanData \
>> > > >                     false \
>> > > >                     --maxPrefsPerUser \
>> > > >                     1000 \
>> > > >                     --maxSimilaritiesPerItem \
>> > > >                     1000 \
>> > > >                     --minPrefsPerUser \
>> > > >                     5 \
>> > > >                     --maxPrefsPerUserInItemSimilarity \
>> > > >                     30 \
>> > > >                     --threshold \
>> > > >                    1.1 \
>> > > >                     --tempDir \
>> > > >                     hdfs://nameservice1/recommenditembased/temp \
>> > > >                     --outputPathForSimilarityMatrix \
>> > > >
>> hdfs://nameservice1/recommenditembased/sim_matrix
>> > > >
>> > > >
>> > > > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>> > > >
>> > > >
>> > > > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
>> > > >
>> > > > > Serega,
>> > > > >
>> > > > > See the last line on how to pass outputPathForSimilarityMatrix
>> > options
>> > > to
>> > > > > the recommenditembased command:
>> > > > >
>> > > > > sudo -u oozie mahout recommenditembased \
>> > > > >                    --input visited_items_with_inverted_items
\
>> > > > >
>> > > > >                    --output result \
>> > > > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD
>> \
>> > > > >                    --usersFile inverted_items \
>> > > > >                    --numRecommendations 500 \
>> > > > >                    --booleanData false \
>> > > > >                    --maxPrefsPerUser 100 \
>> > > > >                    --maxSimilaritiesPerItem 500 \
>> > > > >                    --minPrefsPerUser 0\
>> > > > >                    --maxPrefsPerUserInItemSimilarity 30 \
>> > > > >                    --threshold 0.91 \
>> > > > >                    --tempDir  temp \
>> > > > >                    --outputPathForSimilarityMatrix
>> similarityMatri \
>> > > > >
>> > > > >
>> > > > > Peng Zhang
>> > > > > pzhang.xjtu@gmail.com
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>> > serega.sheypak@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > I've inspected the code, our approach wouldn't work with
>> > > > > booleanData=false.
>> > > > > > We do calcualte imte similarity in the wrong way...(((
>> > > > > > Thank you
>> > > > > > 1. We provide "fake" user_id and provide --usersFile in
order to
>> > get
>> > > > > > recommendations for "fake user_id, where user_id is a negative
>> > > item_id.
>> > > > > It
>> > > > > > worked when we did provide user_id->item_id pairs without
>> > preference.
>> > > > > > 2. Our target is to get item similarities. We tried
>> > > > > >
>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>> > > but
>> > > > > it
>> > > > > > returns bad result comparing to RecommenderJob with our
"fake"
>> > > user_id
>> > > > > > (inverted item_id)
>> > > > > >
>> > > > > > 1. I'll try the option you provided.
>> > > > > > 2. I will remove input with fake user_id and usersFile with
>> these
>> > > fake
>> > > > > ids
>> > > > > >
>> > > > > > 3.
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>> > > > > > I don't understand how to pass ---outputPathForSimilarityMatrix
>> > > option
>> > > > to
>> > > > > > RecommenderJob
>> > > > > >
>> > > > > >
>> > > > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
>> > > > > >
>> > > > > >> Seraga,
>> > > > > >>
>> > > > > >> I have two comments:
>> > > > > >> 1. Don’t use negative user ids. Since Mahout uses
user id as
>> well
>> > as
>> > > > > item
>> > > > > >> id as the row/column index, you’d better use 0, 1,
2, etc as
>> ids
>> > > > > >> 2. If you want to get the item similarity information,
you can
>> use
>> > > > > >> --outputPathForSimilarityMatrix in the command
>> > > > > >>
>> > > > > >> Regards,
>> > > > > >> Peng Zhang
>> > > > > >> M: +86 186-1658-7856
>> > > > > >> pzhang.xjtu@gmail.com
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>> > > serega.sheypak@gmail.com
>> > > > >
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >>> All bad things happen here:
>> > > > > >>>
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> Name
>> > > > > >>>
>> > > > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
>> > > > > >>>
>> > > > > >>> User
>> > > > > >>>
>> > > > > >>> oozie
>> > > > > >>>
>> > > > > >>> Process User
>> > > > > >>>
>> > > > > >>> oozie
>> > > > > >>>
>> > > > > >>> Group
>> > > > > >>>
>> > > > > >>> oozie
>> > > > > >>>
>> > > > > >>> Mapper Class
>> > > > > >>>
>> > > > > >>> PartialMultiplyMapper
>> > > > > >>>
>> > > > > >>> Reducer Class
>> > > > > >>>
>> > > > > >>> AggregateAndRecommendReducer
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> Job Input Directory
>> > > > > >>>
>> > > > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
>> > > > > >>>
>> > > > > >>> Job Output Directory
>> > > > > >>>
>> > > > > >>> hdfs://nameservice1/itemrec/output/
>> > > > > >>>
>> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map
input
>> > > > records=3312879
>> > > > > >>>
>> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map
output
>> > > > records=3313251
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce
input
>> > > > > records=3313251
>> > > > > >>>
>> > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce
output
>> > > records=0
>> > > > > >>>
>> > > > > >>> Why does mahout returns 0 rows? it works when booleanData=true
>> > > > > >> (preferences
>> > > > > >>> are ignored...?)
>> > > > > >>>
>> > > > > >>>
>> > > > > >>>
>> > > > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>> > > serega.sheypak@gmail.com
>> > > > >:
>> > > > > >>>
>> > > > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>> > > > > >>>> users_file:
>> > > > > >>>> --inverted_item_id
>> > > > > >>>> -1
>> > > > > >>>> -2
>> > > > > >>>> -3
>> > > > > >>>> -4
>> > > > > >>>>
>> > > > > >>>> users_items_prefs
>> > > > > >>>> --inverted item_id
>> > > > > >>>> -1 1 1.0
>> > > > > >>>> -2 2 1.0
>> > > > > >>>> -3 3 1.0
>> > > > > >>>> -4 4 1.0
>> > > > > >>>> --user_id item_id pref_value
>> > > > > >>>> 11   1 1.6
>> > > > > >>>> 11   2 1.6
>> > > > > >>>> 123 3 2.0
>> > > > > >>>> 123 4 2.0
>> > > > > >>>> 333 1 2.0
>> > > > > >>>> 333 2 1.6
>> > > > > >>>> --e.t.c.
>> > > > > >>>>
>> > > > > >>>> if I set --booleanData true
>> > > > > >>>> then mahout returns the result.
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman
<
>> > > > > andrew.musselman@gmail.com
>> > > > > >>> :
>> > > > > >>>>
>> > > > > >>>> I'm confused about how you're constructing the
user file, and
>> > why
>> > > > > there
>> > > > > >>>>> are negated item ids here.
>> > > > > >>>>>
>> > > > > >>>>> Can you post some more details please, including
Mahout
>> version
>> > > and
>> > > > > >> some
>> > > > > >>>>> sample data sets?
>> > > > > >>>>>
>> > > > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega
Sheypak <
>> > > > > >> serega.sheypak@gmail.com>
>> > > > > >>>>> wrote:
>> > > > > >>>>>>
>> > > > > >>>>>> Hi, I'm trying to create item similarity.
>> > > > > >>>>>> I gather items which users visit during
shopping and then
>> > > create a
>> > > > > >> file:
>> > > > > >>>>>> user_id, item_id, weight (where weight
can be: [1.0, 1.6,
>> > 1.9],
>> > > > > >> depends
>> > > > > >>>>> on
>> > > > > >>>>>> user action type and data source)
>> > > > > >>>>>> UNION
>> > > > > >>>>>> -item_id, item_id, 1.0 (from items dictionary)
>> > > > > >>>>>>
>> > > > > >>>>>> and I do provide a userFile, where user_id
= -item_id
>> > > > > >>>>>>
>> > > > > >>>>>> The idea is to get item similary. If
any user visits item
>> > named
>> > > > > "A", i
>> > > > > >>>>> want
>> > > > > >>>>>> to show him items "B", "c", "xxx" using
preferences of
>> other
>> > > > users.
>> > > > > >>>>>>
>> > > > > >>>>>> The problem is that the last (???) mapreduce
job returns 0
>> > rows:
>> > > > > >>>>>>
>> > > > > >>>>>> Here are my settings:
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>>> sudo -u oozie mahout recommenditembased
\
>> > > > > >>>>>>                  --input visited_items_with_inverted_items
>> \
>> > > > > >>>>>>
>> > > > > >>>>>>                  --output result \
>> > > > > >>>>>>                  --similarityClassname
>> > SIMILARITY_LOGLIKELIHOOD
>> > > \
>> > > > > >>>>>>                  --usersFile inverted_items
\
>> > > > > >>>>>>                  --numRecommendations
500 \
>> > > > > >>>>>>                  --booleanData false
\
>> > > > > >>>>>>                  --maxPrefsPerUser 100
\
>> > > > > >>>>>>                  --maxSimilaritiesPerItem
500 \
>> > > > > >>>>>>                  --minPrefsPerUser 0\
>> > > > > >>>>>>                  --maxPrefsPerUserInItemSimilarity
30 \
>> > > > > >>>>>>                  --threshold 0.91 \
>> > > > > >>>>>>                  --tempDir  temp \
>> > > > > >>>>>>
>> > > > > >>>>>> Some counters... I don't get what do
they mean....
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>> > > > > >>>>>>
>> > > > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
    USERS=7528530
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> > > > > >>>>>>
>> > > > > >>>>>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> > > > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>> > > > > >>>>> USER_RATINGS_USED=12,429,693
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>> > > > > >>>>>>
>> > > > > >>>>>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
    ROWS=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> > > > > >>>>>>
>> > > > > >>>>>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> > > > COOCCURRENCES=35882374
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>> > > > PRUNED_COOCCURRENCES=0
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Map input
>> > > > > records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Map output
>> > > > > >> records=17570268
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Reduce input
>> > > > > >>>>> records=5221907
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Reduce output
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce input
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce output
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce input
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce output
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Map input
>> > > > > records=7528530
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Map output
>> > > > > >> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Reduce input
>> > > > > >>>>> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Reduce output
>> > > > > >>>>> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Map input
>> > > > > records=6626130
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Map output
>> > > > > >> records=6626130
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Reduce input
>> > > > > >>>>> records=6626130
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Reduce output
>> > > > > >>>>> records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Map input
>> > > > > records=3312879
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Map output
>> > > > > >> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Reduce input
>> > > > > >>>>> records=3313251
>> > > > > >>>>>>
>> > > > > >>>>>> --------
>> > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Reduce output
>> > > > records=0
>> > > > > >>>>>> --------
>> > > > > >>>>>>
>> > > > > >>>>>> why 0???
>> > > > > >>>>>
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message