mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Serega Sheypak <serega.shey...@gmail.com>
Subject Re: recommenditembased returns 0 records from last map-reduce job
Date Mon, 21 Jul 2014 20:14:29 GMT
Hm.. I did take read sample from ratingMatrix
and none of these ids

key[265946039] idx 1278942761 get 1.600000023841858
key[266002133] idx 242466370 get 1.600000023841858
key[266024933] idx 335624517 get 1.600000023841858
key[266024933] idx 291527196 get 1.600000023841858
key[266024933] idx 1406499341 get 1.600000023841858
key[266024933] idx 836009310 get 1.600000023841858
key[266024933] idx 331659103 get 1.600000023841858
key[266106533] idx 689552069 get 1.600000023841858

among my user_id or item_id.
1.600000023841858 looks like preference value for a relation user_id,
item_id, pref

code:
def reader = new SequenceFile.Reader(new Configuration(),
SequenceFile.Reader.file(pathToFile));
    IntWritable key = new IntWritable();
    VectorWritable value = new VectorWritable();

    while(reader.next(key, value)){
        def itr = value.get().iterateNonZero()
        while(itr.hasNext()){
            def elem = itr.next()
            println "key[$key] idx${elem.index()} get${elem.get()}"
        }

    }
    reader.close();




2014-07-21 23:57 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com>:

> temp/preparePreferenceMatrix/ratingMatrix
> has data looks like it's similarity between items...
> I'm confused. How can I get item similarity?
>
>
> 2014-07-21 23:48 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com>:
>
> The code snippet:
>>
>>  @Test//(enabled = false)
>>     void testReadAll(){
>>         (0..5).each {
>>
>>             def pathToFile = new Path('matrixSim/part-r-0000$it")
>>             println pathToFile
>>             def reader = new SequenceFile.Reader(new Configuration(),
>> SequenceFile.Reader.file(pathToFile));
>>             IntWritable key = new IntWritable();
>>             VectorWritable value = new VectorWritable();
>>             while(reader.next(key, value)){
>>                 def itr = value.get().iterateNonZero()
>>                 while(itr.hasNext()){
>>                     println itr.next()
>>                 }
>>             }
>>             reader.close();
>>         }
>>     }
>>
>>
>>  2014-07-21 23:46 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com>:
>>
>> I've parsed it via java, matrix is empty. why?
>>>
>>>
>>> 2014-07-21 22:41 GMT+04:00 Serega Sheypak <serega.sheypak@gmail.com>:
>>>
>>> 0.7-cdh4.7.0
>>>> Anyway, recommenditembased does produce these catalogs:
>>>>
>>>> /recommenditembased/temp/maxValues.bin
>>>> /recommenditembased/temp/norms.bin
>>>> /recommenditembased/temp/numNonZeroEntries.bin
>>>> /recommenditembased/temp/pairwiseSimilarity
>>>> /recommenditembased/temp/partialMultiply
>>>> /recommenditembased/temp/prePartialMultiply1
>>>> /recommenditembased/temp/prePartialMultiply2
>>>> /recommenditembased/temp/preparePreferenceMatrix
>>>> /recommenditembased/temp/similarityMatrix
>>>> /recommenditembased/temp/weights
>>>>
>>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the
>>>> thing In eed. Right now I try to read it using
>>>>
>>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>>>  com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>>>     '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>>>     '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>>> )  as (intId: int, vector:tuple(cardinality:int,
>>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>>>
>>>>
>>>> Looks like the vector is empty... Or i do something wrong.
>>>>
>>>>
>>>>
>>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <ted.dunning@gmail.com>:
>>>>
>>>> Which version of Mahout?
>>>>>
>>>>>
>>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>>>> processing
>>>>> > Job-Specific
>>>>> >
>>>>> > sudo -u hdfs hadoop fs -rm -r
>>>>> hdfs://nameservice1/recommenditembased/output
>>>>> > sudo -u hdfs hadoop fs -rm -r
>>>>> hdfs://nameservice1/recommenditembased/temp
>>>>> > sudo -u oozie mahout recommenditembased \
>>>>> >                     --input \
>>>>> >
>>>>> >
>>>>> >
>>>>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>>>> > \
>>>>> >                     --output \
>>>>> >                     hdfs://nameservice1/recommenditembased/output
\
>>>>> >                     --similarityClassname \
>>>>> >                     SIMILARITY_LOGLIKELIHOOD \
>>>>> >                    --numRecommendations \
>>>>> >                     500 \
>>>>> >                     --booleanData \
>>>>> >                     false \
>>>>> >                     --maxPrefsPerUser \
>>>>> >                     1000 \
>>>>> >                     --maxSimilaritiesPerItem \
>>>>> >                     1000 \
>>>>> >                     --minPrefsPerUser \
>>>>> >                     5 \
>>>>> >                     --maxPrefsPerUserInItemSimilarity \
>>>>> >                     30 \
>>>>> >                     --threshold \
>>>>> >                    1.1 \
>>>>> >                     --tempDir \
>>>>> >                     hdfs://nameservice1/recommenditembased/temp
\
>>>>> >                     --outputPathForSimilarityMatrix \
>>>>> >                     hdfs://nameservice1/recommenditembased/sim_matrix
>>>>> >
>>>>> >
>>>>> > I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>>>> >
>>>>> >
>>>>> > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
>>>>> >
>>>>> > > Serega,
>>>>> > >
>>>>> > > See the last line on how to pass outputPathForSimilarityMatrix
>>>>> options to
>>>>> > > the recommenditembased command:
>>>>> > >
>>>>> > > sudo -u oozie mahout recommenditembased \
>>>>> > >                    --input visited_items_with_inverted_items
\
>>>>> > >
>>>>> > >                    --output result \
>>>>> > >                    --similarityClassname SIMILARITY_LOGLIKELIHOOD
\
>>>>> > >                    --usersFile inverted_items \
>>>>> > >                    --numRecommendations 500 \
>>>>> > >                    --booleanData false \
>>>>> > >                    --maxPrefsPerUser 100 \
>>>>> > >                    --maxSimilaritiesPerItem 500 \
>>>>> > >                    --minPrefsPerUser 0\
>>>>> > >                    --maxPrefsPerUserInItemSimilarity 30 \
>>>>> > >                    --threshold 0.91 \
>>>>> > >                    --tempDir  temp \
>>>>> > >                    --outputPathForSimilarityMatrix similarityMatri
>>>>> \
>>>>> > >
>>>>> > >
>>>>> > > Peng Zhang
>>>>> > > pzhang.xjtu@gmail.com
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com>
>>>>> > > wrote:
>>>>> > >
>>>>> > > > I've inspected the code, our approach wouldn't work with
>>>>> > > booleanData=false.
>>>>> > > > We do calcualte imte similarity in the wrong way...(((
>>>>> > > > Thank you
>>>>> > > > 1. We provide "fake" user_id and provide --usersFile in
order to
>>>>> get
>>>>> > > > recommendations for "fake user_id, where user_id is a
negative
>>>>> item_id.
>>>>> > > It
>>>>> > > > worked when we did provide user_id->item_id pairs without
>>>>> preference.
>>>>> > > > 2. Our target is to get item similarities. We tried
>>>>> > > >
>>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
>>>>> > > it
>>>>> > > > returns bad result comparing to RecommenderJob with our
"fake"
>>>>> user_id
>>>>> > > > (inverted item_id)
>>>>> > > >
>>>>> > > > 1. I'll try the option you provided.
>>>>> > > > 2. I will remove input with fake user_id and usersFile
with
>>>>> these fake
>>>>> > > ids
>>>>> > > >
>>>>> > > > 3.
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>>>> > > > I don't understand how to pass ---outputPathForSimilarityMatrix
>>>>> option
>>>>> > to
>>>>> > > > RecommenderJob
>>>>> > > >
>>>>> > > >
>>>>> > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pzhang.xjtu@gmail.com>:
>>>>> > > >
>>>>> > > >> Seraga,
>>>>> > > >>
>>>>> > > >> I have two comments:
>>>>> > > >> 1. Don’t use negative user ids. Since Mahout uses
user id as
>>>>> well as
>>>>> > > item
>>>>> > > >> id as the row/column index, you’d better use 0,
1, 2, etc as ids
>>>>> > > >> 2. If you want to get the item similarity information,
you can
>>>>> use
>>>>> > > >> --outputPathForSimilarityMatrix in the command
>>>>> > > >>
>>>>> > > >> Regards,
>>>>> > > >> Peng Zhang
>>>>> > > >> M: +86 186-1658-7856
>>>>> > > >> pzhang.xjtu@gmail.com
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>>>> serega.sheypak@gmail.com
>>>>> > >
>>>>> > > >> wrote:
>>>>> > > >>
>>>>> > > >>> All bad things happen here:
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> Name
>>>>> > > >>>
>>>>> > > >>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>>> > > >>>
>>>>> > > >>> User
>>>>> > > >>>
>>>>> > > >>> oozie
>>>>> > > >>>
>>>>> > > >>> Process User
>>>>> > > >>>
>>>>> > > >>> oozie
>>>>> > > >>>
>>>>> > > >>> Group
>>>>> > > >>>
>>>>> > > >>> oozie
>>>>> > > >>>
>>>>> > > >>> Mapper Class
>>>>> > > >>>
>>>>> > > >>> PartialMultiplyMapper
>>>>> > > >>>
>>>>> > > >>> Reducer Class
>>>>> > > >>>
>>>>> > > >>> AggregateAndRecommendReducer
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> Job Input Directory
>>>>> > > >>>
>>>>> > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>>> > > >>>
>>>>> > > >>> Job Output Directory
>>>>> > > >>>
>>>>> > > >>> hdfs://nameservice1/itemrec/output/
>>>>> > > >>>
>>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map
input
>>>>> > records=3312879
>>>>> > > >>>
>>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map
output
>>>>> > records=3313251
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce
input
>>>>> > > records=3313251
>>>>> > > >>>
>>>>> > > >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce
output
>>>>> records=0
>>>>> > > >>>
>>>>> > > >>> Why does mahout returns 0 rows? it works when
booleanData=true
>>>>> > > >> (preferences
>>>>> > > >>> are ignored...?)
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>>>> serega.sheypak@gmail.com
>>>>> > >:
>>>>> > > >>>
>>>>> > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>>> > > >>>> users_file:
>>>>> > > >>>> --inverted_item_id
>>>>> > > >>>> -1
>>>>> > > >>>> -2
>>>>> > > >>>> -3
>>>>> > > >>>> -4
>>>>> > > >>>>
>>>>> > > >>>> users_items_prefs
>>>>> > > >>>> --inverted item_id
>>>>> > > >>>> -1 1 1.0
>>>>> > > >>>> -2 2 1.0
>>>>> > > >>>> -3 3 1.0
>>>>> > > >>>> -4 4 1.0
>>>>> > > >>>> --user_id item_id pref_value
>>>>> > > >>>> 11   1 1.6
>>>>> > > >>>> 11   2 1.6
>>>>> > > >>>> 123 3 2.0
>>>>> > > >>>> 123 4 2.0
>>>>> > > >>>> 333 1 2.0
>>>>> > > >>>> 333 2 1.6
>>>>> > > >>>> --e.t.c.
>>>>> > > >>>>
>>>>> > > >>>> if I set --booleanData true
>>>>> > > >>>> then mahout returns the result.
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman
<
>>>>> > > andrew.musselman@gmail.com
>>>>> > > >>> :
>>>>> > > >>>>
>>>>> > > >>>> I'm confused about how you're constructing
the user file, and
>>>>> why
>>>>> > > there
>>>>> > > >>>>> are negated item ids here.
>>>>> > > >>>>>
>>>>> > > >>>>> Can you post some more details please,
including Mahout
>>>>> version and
>>>>> > > >> some
>>>>> > > >>>>> sample data sets?
>>>>> > > >>>>>
>>>>> > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega
Sheypak <
>>>>> > > >> serega.sheypak@gmail.com>
>>>>> > > >>>>> wrote:
>>>>> > > >>>>>>
>>>>> > > >>>>>> Hi, I'm trying to create item similarity.
>>>>> > > >>>>>> I gather items which users visit during
shopping and then
>>>>> create a
>>>>> > > >> file:
>>>>> > > >>>>>> user_id, item_id, weight (where weight
can be: [1.0, 1.6,
>>>>> 1.9],
>>>>> > > >> depends
>>>>> > > >>>>> on
>>>>> > > >>>>>> user action type and data source)
>>>>> > > >>>>>> UNION
>>>>> > > >>>>>> -item_id, item_id, 1.0 (from items
dictionary)
>>>>> > > >>>>>>
>>>>> > > >>>>>> and I do provide a userFile, where
user_id = -item_id
>>>>> > > >>>>>>
>>>>> > > >>>>>> The idea is to get item similary.
If any user visits item
>>>>> named
>>>>> > > "A", i
>>>>> > > >>>>> want
>>>>> > > >>>>>> to show him items "B", "c", "xxx"
using preferences of other
>>>>> > users.
>>>>> > > >>>>>>
>>>>> > > >>>>>> The problem is that the last (???)
mapreduce job returns 0
>>>>> rows:
>>>>> > > >>>>>>
>>>>> > > >>>>>> Here are my settings:
>>>>> > > >>>>>>
>>>>> > > >>>>>>
>>>>> > > >>>>>> sudo -u oozie mahout recommenditembased
\
>>>>> > > >>>>>>                  --input visited_items_with_inverted_items
\
>>>>> > > >>>>>>
>>>>> > > >>>>>>                  --output result \
>>>>> > > >>>>>>                  --similarityClassname
>>>>> SIMILARITY_LOGLIKELIHOOD \
>>>>> > > >>>>>>                  --usersFile inverted_items
\
>>>>> > > >>>>>>                  --numRecommendations
500 \
>>>>> > > >>>>>>                  --booleanData false
\
>>>>> > > >>>>>>                  --maxPrefsPerUser
100 \
>>>>> > > >>>>>>                  --maxSimilaritiesPerItem
500 \
>>>>> > > >>>>>>                  --minPrefsPerUser
0\
>>>>> > > >>>>>>                  --maxPrefsPerUserInItemSimilarity
30 \
>>>>> > > >>>>>>                  --threshold 0.91
\
>>>>> > > >>>>>>                  --tempDir  temp \
>>>>> > > >>>>>>
>>>>> > > >>>>>> Some counters... I don't get what
do they mean....
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>> > > >>>>>>
>>>>> > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
    USERS=7528530
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>> > > >>>>>>
>>>>> > > >>>>>
>>>>> > > >>
>>>>> > >
>>>>> >
>>>>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>> > > >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>> > > >>>>> USER_RATINGS_USED=12,429,693
>>>>> > > >>>>>>
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>> > > >>>>>>
>>>>> > > >>>>>
>>>>> > > >>
>>>>> > >
>>>>> >
>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
    ROWS=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> > > >>>>>>
>>>>> > > >>>>>
>>>>> > > >>
>>>>> > >
>>>>> >
>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> > COOCCURRENCES=35882374
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> > PRUNED_COOCCURRENCES=0
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Map input
>>>>> > > records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Map output
>>>>> > > >> records=17570268
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Reduce input
>>>>> > > >>>>> records=5221907
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:
    Reduce output
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce input
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce output
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce input
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:
    Reduce output
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Map input
>>>>> > > records=7528530
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Map output
>>>>> > > >> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Reduce input
>>>>> > > >>>>> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:
    Reduce output
>>>>> > > >>>>> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Map input
>>>>> > > records=6626130
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Map output
>>>>> > > >> records=6626130
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Reduce input
>>>>> > > >>>>> records=6626130
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:
    Reduce output
>>>>> > > >>>>> records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Map input
>>>>> > > records=3312879
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Map output
>>>>> > > >> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Reduce input
>>>>> > > >>>>> records=3313251
>>>>> > > >>>>>>
>>>>> > > >>>>>> --------
>>>>> > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:
    Reduce output
>>>>> > records=0
>>>>> > > >>>>>> --------
>>>>> > > >>>>>>
>>>>> > > >>>>>> why 0???
>>>>> > > >>>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>
>>>>> > > >>
>>>>> > >
>>>>> > >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message