mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Palumbo <ap....@outlook.com>
Subject RE: Insights to Naive Bayes classifier example - 20news groups
Date Tue, 02 Dec 2014 16:33:17 GMT


> Date: Tue, 2 Dec 2014 14:06:44 +0100
> Subject: Re: Insights to Naive Bayes classifier example - 20news groups
> From: stransky.ja@gmail.com
> To: user@mahout.apache.org
> 
> Hi Andrew,
> 
> many thanks for final clarification! Now I have last question - probably
> the most obvious but I missed it somewhere probably. Because all the
> examples ends up by testing the classifier - display confusion matrix.  So
> the state is:
> We have a trained and tested model and now we would like to use the model
> to classify  unseen, unknown data - actually use the classifier. For sure
> it is clear how to prepare the input - vectorize etc. What is not clear to
> me at the moment is how do I call trained model with new vectorized data as
> an input. Or may be even the vectorization itself - because we need
> probably the dictionary used by model to produce a valid vectors. What
> about terms which we not in the training set etc.
> 
> Is there any documentation regarding this aspect?

As of Mahout 0.9 there are no CLI drivers available to vectorize and classify new documents.
 There is a ticket open for Mahout 1.0 regarding this.  Currently you'll have to write a utility
class to vectorize and classify new documents.  As you mentioned, you'll need to use the same
dictionary.file-0 as is created by seq2sparse for training.  As well if you're using TF-IDF
weights you'll need to use the same df-count file to compute the IDF.  Both are located in
the directory output by seq2sparse.    You'll also want to use the same maxNgramSize as you
used to train the model.  If you want to keep it simple, by using unigrams, you can avoid
Lucene integration, an just keep a count the occurences of tokenized terms. Terms unseen by
the training set can be rejected.

Once the document is vectorized, you can use BayesUtils.readModelFromDir(..) to retrieve your
model, BayesUtils.readLabelIndex(..) [1], and (Complemtary)StandardNaiveBayesClassifier.classifyFull(...)[2]
to classify your vector. You can also look at TestNaiveBayesDriver.AnalyzeResults[3] to see
how labels are assigned.

There's no documentation on the Mahout site at the moment. There is a good blog post here
that can give you an Idea of how to get started:

https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

[1] https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java
[2] https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/StandardNaiveBayesClassifier.java
[3] https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java

     
> 
> Thx
> Jakub
> 
> 
> 
> On 1 December 2014 at 21:12, Andrew Palumbo <ap.dev@outlook.com> wrote:
> 
> >
> >
> >
> > > However the sequence of steps as described in Mahout Cookbook seems to me
> > > incorrect as:
> >
> > this is entirely possible, that book may be out of date. The end to end
> > instructions on the website for the 20 newsgroups example is up to date
> > though.  As is the example script.
> >
> > You don't want to merge all of the files into one directory, rather to
> > merge the training and testing sets in 20news-bydate while maintaining
> > their directory structure.
> >
> > > After data set download and extraction data are merged via command:
> > > *cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*
> > >
> > > Which essentially copies files to a single location -> 20news-all folder
> >
> > this should not copy all of the *files* individually into the 20news-all
> > folder rather the directories containing the files:
> >
> >     $ ls 20news-all/
> >     alt.atheism               rec.autos           sci.space
> >     comp.graphics             rec.motorcycles     soc.religion.christian
> >     {...}
> >
> > > *./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
> > > ${WORK_DIR}/20news-seq*
> > > Converts to a hadoop sequence directory from 20news-all dir - where all
> > > files were copied and efffectively the classification to folders were
> > lost.
> > > We can peek inside a created seq file via hadoop fs -text
> > > $WORK_DIR/20news-seq/chunck-0 | more which prints following result:
> > >
> > > */67399* From:xxx
> > > Subject: Re: Imake-TeX: looking for beta testers
> > > Organization: CS Department, Dortmund University, Germany
> > > Lines: 59
> > > Distribution: world
> > > NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
> > > In article <xxxxx>,
> > > yyy writes:
> > > |> As I announced at the X Technical Conference in January, I would
> > > like
> > > |> to
> > > |> make Imake-TeX, the Imake support for using the TeX typesetting
> > > system,
> > > |> publically available. Currently Imake-TeX is in beta test here at
> > > the
> > > |> computer science department of Dortmund University, and I am
> > > looking
> > > ...
> > >
> > > To my understanding - number after slash in bold represents a key of
> > > sequence file, right?
> >
> > Correct though it should read something like:
> >
> >     /comp.graphics/67399 {...}
> >
> > where comp.graphics is the category as well as the directory that it was
> > read in from.
> >
> > > Then seq2sparse is performed:
> > >
> > > ./mahout seq2sparse  -i ${WORK_DIR}/20news-seq vectors -lnorm -nv  -wt
> > > tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
> > >
> > >
> > > *Conclusions which I would like to verify:*
> > > - sequence of steps as described is incorrect - particularly conversion
> > to
> > > sequence file as the key doesn't contain folder name describing the
> > > category of training data, or am I still missing something in here?
> >
> > yes- it looks like you are copying the individual files rather than the
> > directories into 20news-all
> >
> > >
> > > - mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o
> > > ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow
> > >   What are the exact mechanics when label extraction is performed e.g.
> > > /category/docID as a key is resolved just to category ???
> >
> > yes
> >
> > > Does every time
> > > the last part after the slash is dropped as a category?? Or is is
> > possible
> > > to define the strategy somewhere?
> >
> > The hard-coded convention as of Mahout 0.9 is to extract the label as the
> > first string after the key is split on "/".  This makes category
> > organization by directory and sequence file conversion with seqdirectory
> > straightforward.  The new scala DSL Naive Bayes which is currently in
> > development will allow the user more flexibility in extracting the label.
> >
> > The label extraction process can be found here:
> >
> > https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java
> >
> > and could me modified if need be.
> >
> > >
> > > Thanks
> > > Jakub
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 1 December 2014 at 17:43, Andrew Palumbo <ap.dev@outlook.com> wrote:
> > >
> > > > Hi Jakub,
> > > >
> > > > The step that you are missing is `$mahout seqdir ...`.   in this step
> > each
> > > > file in each directory (where the directory is the Category) is
> > converted
> > > > into a sequence file of form <Text,Text>  where the Text key is
> > > > /Category/doc_id.
> > > >
> > > > `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...`
> > > > into a sequence file of form <Text, VectorWritable> leaving the
Keys
> > > > unchanged.
> > > >
> > > > `$mahout trainnb ... -el ...` then extracts the label from the Keys of
> > the
> > > > training data ie. the "Category" from /Category/doc_id.
> > > >
> > > > please see
> > > > http://mahout.apache.org/users/classification/twenty-newsgroups.html
> > > > and http://mahout.apache.org/users/classification/bayesian.html
> > > > for more information.
> > > >
> > > > > Date: Mon, 1 Dec 2014 17:09:55 +0100
> > > > > Subject: Insights to Naive Bayes classifier example - 20news groups
> > > > > From: stransky.ja@gmail.com
> > > > > To: user@mahout.apache.org
> > > > >
> > > > > Hello Mahout experts,
> > > > >
> > > > > I am trying to follow some examples provided with Mahout and some
> > > > features
> > > > > are not clear to me. It would be great if someone could clarify a
bit
> > > > more.
> > > > >
> > > > > To prepare a the data (train and test) the following sequence of
> > steps is
> > > > > perfomed (taken from mahout cookbook):
> > > > >
> > > > > All input is merged into single dir:
> > > > > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> > > > >
> > > > > Converted to hadoop sequence file and then vectorized:
> > > > > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o
> > > > ${WORK_DIR}/20news-**vectors
> > > > > -lnorm -nv -wt tfidf*
> > > > >
> > > > > Devided to test and train data:
> > > > > *./mahout split*
> > > > > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> > > > > *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> > > > > *--testOutput ${WORK_DIR}/20news-test-vectors*
> > > > > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> > > > >
> > > > > Model is trained:
> > > > > *./mahout trainnb*
> > > > > *-i ${WORK_DIR}/20news-train-vectors -el*
> > > > > *-o ${WORK_DIR}/model*
> > > > > *-li ${WORK_DIR}/labelindex*
> > > > > *-ow*
> > > > >
> > > > >
> > > > > What I am missing here and that is subject of my question is: Where
> > is
> > > > the
> > > > > category assigned to the testing data to train the categorization?
> > What I
> > > > > would expect is that there will be vector which says that this
> > document
> > > > > belongs to a particular category. This seems to me has been ereased
> > by
> > > > > first step where we mixed all the data to create our corpus. I would
> > > > still
> > > > > expect that this information will be somewhere retained. Instead
the
> > > > > messages looks as follows:
> > > > >
> > > > > From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
> > > > > Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> > > > > Organization: Oklahoma State University
> > > > > Lines: 7
> > > > >
> > > > > From article <a4Fm3B1w165w@vicuna.ocunix.on.ca>, by Steve Frampton
<
> > > > > frampton@vicuna.ocunix.on.ca>:
> > > > > > I was wondering, is the "Kermit" package (the actual package,
not a
> > > > >
> > > > > Yes!  In the usual ftp sites.
> > > > >
> > > > > Yek CHong
> > > > >
> > > > >
> > > > > There is no notion from which group this text belongs to. What's
the
> > > > hack!
> > > > >
> > > > > Could someone please clarify a bit what's going on as when
> > > > crosswalidation
> > > > > is performed - confusion matrix takes into consideration those
> > > > categories.
> > > > >
> > > > > Thanks a lot for helping me out
> > > > > Jakub
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jakub Stransky
> > > cz.linkedin.com/in/jakubstransky
> >
> >
> 
> 
> 
> -- 
> Jakub Stransky
> cz.linkedin.com/in/jakubstransky
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message