mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Musselman <andrew.mussel...@gmail.com>
Subject Re: Elephant-Bird, Pig, and Mahout
Date Thu, 05 Dec 2013 18:21:16 GMT
There's an example on the Readme at
https://github.com/kevinweil/elephant-bird/blob/master/Readme.md

Do you have a key to use for each vector?

I've done stuff like this, and I don't know off-hand if you need to have
your records in a tuple to use VectorWritableConverter.

register path/to/lib/mahout/mahout-*.jar
register path/to/elephant-bird-hadoop*.jar
register path/to/elephant-bird-hadoop*.jar
register path/to/elephant-bird-mahout*.jar
register path/to/elephant-bird-pig*.jar
%declare SEQFILE_STORAGE
'com.twitter.elephantbird.pig.store.SequenceFileStorage';
%declare INT_CONVERTER
'com.twitter.elephantbird.pig.util.IntWritableConverter';
%declare LONG_CONVERTER
'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare VECTOR_CONVERTER
'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
a = load 'input' as (
  pid: long,
  v: (
    f1: int,
    f2: int,
    f3: int));

store a into 'output' using $SEQFILE_STORAGE ('-c $LONG_CONVERTER', '-c
$VECTOR_CONVERTER');


On Thu, Dec 5, 2013 at 9:35 AM, Sameer Tilak <sstilak@live.com> wrote:

> Hi All,
> I have some question about using EB's VectorWritableConverter in my Pig
> script for data vectorization.
> I am generating the tuples using a UDF, however for
> simplicity I am loading the data from a file in the following code. My
> UDF returns tuples of the form (1,0,1,1...) etc.
>
> My map.dat file has the following format:
>
> 1,0,1,1
> 0,1,1,1,
> 0,0,1,1,
> 1,1,0,0,
> .......
> .......
> ........
>
> I register the necessary jar files.
>
> %declare SEQFILE_LOADER
> 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
> %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
> %declare VECTOR_CONVERTER
> 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
>
> /* Loading from a file instead of UDF for simplicity */
>
> A = LOAD 'map.dat';
>
> /*
>  I am not sure how to use the VectorWritableConverter to convert tuple
> in the relation A to a vector using VectorWritableConverter */
> B = FOREACH A GENERATE $VECTOR_CONVERTER();
>
> DUMP B;
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message