mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <>
Subject Question about storage in Pig-vector (Pig + Mahout)
Date Fri, 11 May 2012 18:38:58 GMT
I'm trying to run the simple 20-newsgroups example to train a Mahout
classifier using Pig and am unsure about the elephant-bird stuff.

First, after battling with getting a build of elephant-bird, the store to
SequenceFile didn't work for me. Then I saw the PigModelStorage and just
used that and it works just fine. Here is my script (with comments removed
for brevity):

-- Train:

register '.../target/pig-vector-1.0-jar-with-dependencies.jar';

define train org.apache.mahout.pig.LogisticRegression('iterations=5,
inMemory=true, features=100000, categories=alt.atheism
comp.sys.mac.hardware sci.electronics talk.politics.guns
talk.politics.mideast talk.politics.misc sci.crypt
soc.religion.christian talk.religion.misc');

docs = load '20news-bydate-train/*/*' using
    as (newsgroup, id:int, subject, body);

define encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000',
'subject+body', 'group:word, article:numeric, subject:text, body:text');
vectors = foreach docs generate newsgroup, encodeVector(*) as v;

grouped = group vectors all;

model = foreach grouped generate 1 as key, train(vectors) as model;

store model into 'pv-tmp/news_model2' using

-- Eval:

define evaluate
test = load '20news-bydate-test/*/*' using
    as (newsgroup, id:int, subject, body);
testvecs = foreach test generate newsgroup, encodeVector(*) as v;
describe testvecs;
evalvecs = foreach testvecs generate evaluate(v);

dump evalvecs;


So my main question is what does the elephant-bird model storage stuff do
that PigModelStorage doesn't?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message