mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colum Foley <columfo...@gmail.com>
Subject Re: Elephant-Bird SequenceFile Storage of RandomAccessSparseVectors for Mahout
Date Mon, 04 Mar 2013 17:42:13 GMT
Hi Jake, Andy,

Indeed that was the problem, I had thought the cardinality value was
for the number of items in the  bag, many thanks for the help!

Is it OK to overestimate this value or does it need to match the
actual cardinality exactly?

Thanks,
Colum





On Mon, Mar 4, 2013 at 4:31 PM, Andy Schlaikjer
<andrew.schlaikjer@gmail.com> wrote:
> Thanks Jake, yes, that's the first thing to fix-- Generally, all of your
> sparse vectors should have the same "size" (cardinality), but may have
> different numbers of non-default values.
>
> Try updating your example input data to read:
>
> bbb     (10000,{(6595,4.0),(608,1.0)})
> ccd     (10000,{(9763,1.0)})
> adc     (10000,{(3670,1.0)})
> ads     (10000,{(2297,1.0)})
>
> All of your indices must fall within [0, cardinality).
>
> Andy
>
>
>
> On Mon, Mar 4, 2013 at 8:17 AM, Jake Mannix <jake.mannix@gmail.com> wrote:
>
>> I think the issue is with your understanding of  what 'cardinality' means
>> here: it is the *dimension* of the vector (featureSpaceSize), not the
>> number of nonzero elements in that particular vector
>>
>> On Monday, March 4, 2013, Colum Foley wrote:
>>
>> > Hi Andy,
>> >
>> > I am using Pig 0.10.0, (but am happy to try another). Yes, I am
>> > running in local mode with the example data below.
>> >
>> > Thanks again,
>> > Colum
>> >
>> > On Mon, Mar 4, 2013 at 3:02 PM, Andy Schlaikjer
>> > <andrew.schlaikjer@gmail.com> wrote:
>> > > Colum, thank you for passing on details. Could you also share with us
>> > > the version of pig you are running? I assume you're running in local
>> > > mode with the example data below?
>> > >
>> > >
>> > > On Mar 4, 2013, at 3:19 AM, Colum Foley <columfoley@gmail.com> wrote:
>> > >
>> > >> Hi Andy, Ted,
>> > >>
>> > >> Thank you both for replying. Below I will describe the input data,
the
>> > >> pig script I am using, and the resulting output.
>> > >>
>> > >> -Input data is the following (in file 'vectorsPigStored.dat' ):
>> > >>
>> > >> bbb    (2,{(6595,4.0),(608,1.0)})
>> > >> ccd    (1,{(9763,1.0)})
>> > >> adc    (1,{(3670,1.0)})
>> > >> ads    (1,{(2297,1.0)})
>> > >>
>> > >>
>> > >> -The full Pig script I am running is as follows:
>> > >>
>> > >>
>> > >> REGISTER 'elephant-bird-core-3.0.7.jar'
>> > >> REGISTER 'elephant-bird-pig-3.0.7.jar'
>> > >> REGISTER 'elephant-bird-mahout-3.0.7.jar'
>> > >> REGISTER 'mahout-core-0.7.jar'
>> > >> REGISTER 'mahout-math-0.7.jar'
>> > >>
>> > >> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val:
>> > >> (cardinality: int, entries: {entry: (index: int, value: double)}));
>> > >> --Store output
>> > >> store pair into 'output' using
>> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage (
>> > >>   '-c com.twitter.elephantbird.pig.util.TextConverter',
>> > >>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> > >> );
>> > >> --Store output without params for comparison
>> > >> store pair into 'outputRaw' using
>> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage ();
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -Here is the output that I see, printed line by line and below is the
>> > >> type of input (using SequenceFile.Reader, reader.getKeyClass())
>> > >>
>> > >> -- from 'output'
>> > >> bbb  {}
>> > >> ccd  {}
>> > >> adc  {}
>> > >> ads  {}
>> > >> class org.apache.hadoop.io.Text    class
>> > org.apache.mahout.math.VectorWritable
>> > >>
>> > >> --from 'outputRaw'
>> > >> bbb  (2,{(6595,4.0),(608,1.0)})
>> > >> ccd  (1,{(9763,1.0)})
>> > >> adc  (1,{(3670,1.0)})
>> > >> ads  (1,{(2297,1.0)})
>> > >> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
>> > >>
>> > >>
>> > >>
>> > >> **Just to confirm that the issue wasn't with my use of chararray keys
>> > >> (instead of integer keys), I also tried a run with using int keys,
but
>> > >> the result is the same:
>> > >>
>> > >>
>> > >>
>> > >> --Output when using SequenceFileStorage with params  '-c
>> > >> com.twitter.elephantbird.pig.util.IntWritableConverter','-c
>> > >> com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> > >>
>> > >> 1  {}
>> > >> 1  {}
>> > >> 1  {}
>> > >> 1  {}
>> > >> 1  {}
>> > >> class org.apache.hadoop.io.IntWritable    class
>> > >> org.apache.mahout.math.VectorWritable
>> > >>
>> > >> --Output from SequenceFileStorage without params
>> > >> 1  (2,{(6595,4.0),(608,1.0)})
>> > >> 1  (1,{(9763,1.0)})
>> > >> 1  (1,{(3670,1.0)})
>> > >> 1  (1,{(2297,1.0)})
>> > >> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> Any help greatly appreciated,
>> > >>
>> > >> Thanks again,
>> > >> Colum
>> > >>
>> > >>
>> > >>
>> > >> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <ted.dunning@gmail.com>
>> > wrote:
>> > >>> Andy,
>> > >>>
>> > >>> Thanks for popping up!
>> > >>>
>> > >>> Elephant bird looks like it has awesome potential to make machine
>> > learning
>> > >>> with Hadoop vastly easier.  It is really good to see this kind
of
>> > response
>> > >>> ... that is what turns potential into action.
>> > >>>
>> > >>> Thanks again.
>> > >>>
>> > >>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <
>>
>>
>>
>> --
>>
>>   -jake
>>

Mime
View raw message