mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <thelabd...@gmail.com>
Subject Question about storage in Pig-vector (Pig + Mahout)
Date Fri, 11 May 2012 18:38:58 GMT
I'm trying to run the simple 20-newsgroups example to train a Mahout
classifier using Pig and am unsure about the elephant-bird stuff.

First, after battling with getting a build of elephant-bird, the store to
SequenceFile didn't work for me. Then I saw the PigModelStorage and just
used that and it works just fine. Here is my script (with comments removed
for brevity):

-- Train:

register '.../target/pig-vector-1.0-jar-with-dependencies.jar';

define train org.apache.mahout.pig.LogisticRegression('iterations=5,
inMemory=true, features=100000, categories=alt.atheism
comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
comp.graphics comp.windows.x rec.sport.baseball sci.med
talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey
sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
soc.religion.christian talk.religion.misc');

docs = load '20news-bydate-train/*/*' using
org.apache.mahout.pig.MessageLoader()
    as (newsgroup, id:int, subject, body);

define encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000',
'subject+body', 'group:word, article:numeric, subject:text, body:text');
vectors = foreach docs generate newsgroup, encodeVector(*) as v;

grouped = group vectors all;

model = foreach grouped generate 1 as key, train(vectors) as model;

store model into 'pv-tmp/news_model2' using
org.apache.mahout.pig.PigModelStorage();


-- Eval:

define evaluate
org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-tmp/news_model2/part-r-00000,
key=1');
test = load '20news-bydate-test/*/*' using
org.apache.mahout.pig.MessageLoader()
    as (newsgroup, id:int, subject, body);
testvecs = foreach test generate newsgroup, encodeVector(*) as v;
describe testvecs;
evalvecs = foreach testvecs generate evaluate(v);

dump evalvecs;

----

So my main question is what does the elephant-bird model storage stuff do
that PigModelStorage doesn't?

Cheers,
Tim

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message