mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer Tilak <ssti...@live.com>
Subject RE: Elephant-Bird, Pig, and Mahout
Date Wed, 11 Dec 2013 22:55:52 GMT
Hi Andrew et al.,
I have the following statement in my pig script. 
AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into '/scratch/AU';
AU has the following format: 
(userid, (item_view_history))
(27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1))
I will have at least few hundred thousand numbers in the  (item_view_history), for readability
I am just showing 5 here.

VectorizedInput = FOREACH AU GENERATE FLATTEN($0);/*I am assuming the filed userid will be
used as a key and will be written using $INT_CONVERTER', and the tuple will be written using
$VECTOR_CONVERTER'. Is this correct? 
STORE VectorizedInput into '/scratch/VectorizedInput' using $SEQFILE_STORAGE ('-c $INT_CONVERTER',
'-c $VECTOR_CONVERTER');
I can see that /scratch/VectorizedInput has part- files. These files are binary so hard to
know if the script is correct. Can anyone please comment whether the understanding of the
SEQFILE_STORAGE and VECTOR_CONVERTER is correct or not?


> Date: Thu, 5 Dec 2013 10:21:16 -0800
> Subject: Re: Elephant-Bird, Pig, and Mahout
> From: andrew.musselman@gmail.com
> To: user@mahout.apache.org
> 
> There's an example on the Readme at
> https://github.com/kevinweil/elephant-bird/blob/master/Readme.md
> 
> Do you have a key to use for each vector?
> 
> I've done stuff like this, and I don't know off-hand if you need to have
> your records in a tuple to use VectorWritableConverter.
> 
> register path/to/lib/mahout/mahout-*.jar
> register path/to/elephant-bird-hadoop*.jar
> register path/to/elephant-bird-hadoop*.jar
> register path/to/elephant-bird-mahout*.jar
> register path/to/elephant-bird-pig*.jar
> %declare SEQFILE_STORAGE
> 'com.twitter.elephantbird.pig.store.SequenceFileStorage';
> %declare INT_CONVERTER
> 'com.twitter.elephantbird.pig.util.IntWritableConverter';
> %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
> %declare VECTOR_CONVERTER
> 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
> a = load 'input' as (
>   pid: long,
>   v: (
>     f1: int,
>     f2: int,
>     f3: int));
> 
> store a into 'output' using $SEQFILE_STORAGE ('-c $LONG_CONVERTER', '-c
> $VECTOR_CONVERTER');
> 
> 
> On Thu, Dec 5, 2013 at 9:35 AM, Sameer Tilak <sstilak@live.com> wrote:
> 
> > Hi All,
> > I have some question about using EB's VectorWritableConverter in my Pig
> > script for data vectorization.
> > I am generating the tuples using a UDF, however for
> > simplicity I am loading the data from a file in the following code. My
> > UDF returns tuples of the form (1,0,1,1...) etc.
> >
> > My map.dat file has the following format:
> >
> > 1,0,1,1
> > 0,1,1,1,
> > 0,0,1,1,
> > 1,1,0,0,
> > .......
> > .......
> > ........
> >
> > I register the necessary jar files.
> >
> > %declare SEQFILE_LOADER
> > 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> > %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
> > %declare LONG_CONVERTER
> > 'com.twitter.elephantbird.pig.util.LongWritableConverter';
> > %declare VECTOR_CONVERTER
> > 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
> >
> > /* Loading from a file instead of UDF for simplicity */
> >
> > A = LOAD 'map.dat';
> >
> > /*
> >  I am not sure how to use the VectorWritableConverter to convert tuple
> > in the relation A to a vector using VectorWritableConverter */
> > B = FOREACH A GENERATE $VECTOR_CONVERTER();
> >
> > DUMP B;
> >
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message