mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakub Stransky <stransky...@gmail.com>
Subject Insights to Naive Bayes classifier example - 20news groups
Date Mon, 01 Dec 2014 16:09:55 GMT
Hello Mahout experts,

I am trying to follow some examples provided with Mahout and some features
are not clear to me. It would be great if someone could clarify a bit more.

To prepare a the data (train and test) the following sequence of steps is
perfomed (taken from mahout cookbook):

All input is merged into single dir:
*cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*

Converted to hadoop sequence file and then vectorized:
*./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors
-lnorm -nv -wt tfidf*

Devided to test and train data:
*./mahout split*
*-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
*--trainingOutput ${WORK_DIR}/20news-train-vectors*
*--testOutput ${WORK_DIR}/20news-test-vectors*
*--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*

Model is trained:
*./mahout trainnb*
*-i ${WORK_DIR}/20news-train-vectors -el*
*-o ${WORK_DIR}/model*
*-li ${WORK_DIR}/labelindex*
*-ow*


What I am missing here and that is subject of my question is: Where is the
category assigned to the testing data to train the categorization? What I
would expect is that there will be vector which says that this document
belongs to a particular category. This seems to me has been ereased by
first step where we mixed all the data to create our corpus. I would still
expect that this information will be somewhere retained. Instead the
messages looks as follows:

From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
Organization: Oklahoma State University
Lines: 7

>From article <a4Fm3B1w165w@vicuna.ocunix.on.ca>, by Steve Frampton <
frampton@vicuna.ocunix.on.ca>:
> I was wondering, is the "Kermit" package (the actual package, not a

Yes!  In the usual ftp sites.

Yek CHong


There is no notion from which group this text belongs to. What's the hack!

Could someone please clarify a bit what's going on as when crosswalidation
is performed - confusion matrix takes into consideration those categories.

Thanks a lot for helping me out
Jakub

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message