mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakub Stransky <stransky...@gmail.com>
Subject Re: Insights to Naive Bayes classifier example - 20news groups
Date Mon, 01 Dec 2014 19:05:22 GMT
Hi Andrew,

thanks for your response which points me to the missing piece of the
puzzle! However there is still something which is not clear to me. Either
to me it seems that the sequence of the commands is not correct or I
haven't fully grasped the elementary mechanics here. I understand the
seqdirectory and seq2sparse as described here:
http://mahout.apache.org/users/basics/creating-vectors-from-text.html

However the sequence of steps as described in Mahout Cookbook seems to me
incorrect as:

After data set download and extraction data are merged via command:
*cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*

Which essentially copies files to a single location -> 20news-all folder

*./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
${WORK_DIR}/20news-seq*
Converts to a hadoop sequence directory from 20news-all dir - where all
files were copied and efffectively the classification to folders were lost.
We can peek inside a created seq file via hadoop fs -text
$WORK_DIR/20news-seq/chunck-0 | more which prints following result:

*/67399* From:xxx
Subject: Re: Imake-TeX: looking for beta testers
Organization: CS Department, Dortmund University, Germany
Lines: 59
Distribution: world
NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
In article <xxxxx>,
yyy writes:
|> As I announced at the X Technical Conference in January, I would
like
|> to
|> make Imake-TeX, the Imake support for using the TeX typesetting
system,
|> publically available. Currently Imake-TeX is in beta test here at
the
|> computer science department of Dortmund University, and I am
looking
...

To my understanding - number after slash in bold represents a key of
sequence file, right?

Then seq2sparse is performed:

./mahout seq2sparse  -i ${WORK_DIR}/20news-seq vectors -lnorm -nv  -wt
tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf


*Conclusions which I would like to verify:*
- sequence of steps as described is incorrect - particularly conversion to
sequence file as the key doesn't contain folder name describing the
category of training data, or am I still missing something in here?

- mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o
${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow
  What are the exact mechanics when label extraction is performed e.g.
/category/docID as a key is resolved just to category ??? Does every time
the last part after the slash is dropped as a category?? Or is is possible
to define the strategy somewhere?

Thanks
Jakub














On 1 December 2014 at 17:43, Andrew Palumbo <ap.dev@outlook.com> wrote:

> Hi Jakub,
>
> The step that you are missing is `$mahout seqdir ...`.   in this step each
> file in each directory (where the directory is the Category) is converted
> into a sequence file of form <Text,Text>  where the Text key is
> /Category/doc_id.
>
> `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...`
> into a sequence file of form <Text, VectorWritable> leaving the Keys
> unchanged.
>
> `$mahout trainnb ... -el ...` then extracts the label from the Keys of the
> training data ie. the "Category" from /Category/doc_id.
>
> please see
> http://mahout.apache.org/users/classification/twenty-newsgroups.html
> and http://mahout.apache.org/users/classification/bayesian.html
> for more information.
>
> > Date: Mon, 1 Dec 2014 17:09:55 +0100
> > Subject: Insights to Naive Bayes classifier example - 20news groups
> > From: stransky.ja@gmail.com
> > To: user@mahout.apache.org
> >
> > Hello Mahout experts,
> >
> > I am trying to follow some examples provided with Mahout and some
> features
> > are not clear to me. It would be great if someone could clarify a bit
> more.
> >
> > To prepare a the data (train and test) the following sequence of steps is
> > perfomed (taken from mahout cookbook):
> >
> > All input is merged into single dir:
> > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> >
> > Converted to hadoop sequence file and then vectorized:
> > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o
> ${WORK_DIR}/20news-**vectors
> > -lnorm -nv -wt tfidf*
> >
> > Devided to test and train data:
> > *./mahout split*
> > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> > *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> > *--testOutput ${WORK_DIR}/20news-test-vectors*
> > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> >
> > Model is trained:
> > *./mahout trainnb*
> > *-i ${WORK_DIR}/20news-train-vectors -el*
> > *-o ${WORK_DIR}/model*
> > *-li ${WORK_DIR}/labelindex*
> > *-ow*
> >
> >
> > What I am missing here and that is subject of my question is: Where is
> the
> > category assigned to the testing data to train the categorization? What I
> > would expect is that there will be vector which says that this document
> > belongs to a particular category. This seems to me has been ereased by
> > first step where we mixed all the data to create our corpus. I would
> still
> > expect that this information will be somewhere retained. Instead the
> > messages looks as follows:
> >
> > From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
> > Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> > Organization: Oklahoma State University
> > Lines: 7
> >
> > From article <a4Fm3B1w165w@vicuna.ocunix.on.ca>, by Steve Frampton <
> > frampton@vicuna.ocunix.on.ca>:
> > > I was wondering, is the "Kermit" package (the actual package, not a
> >
> > Yes!  In the usual ftp sites.
> >
> > Yek CHong
> >
> >
> > There is no notion from which group this text belongs to. What's the
> hack!
> >
> > Could someone please clarify a bit what's going on as when
> crosswalidation
> > is performed - confusion matrix takes into consideration those
> categories.
> >
> > Thanks a lot for helping me out
> > Jakub
>
>



-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message