mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Anil" <robin.a...@gmail.com>
Subject Re: Problems with the Bayesian classifiers.
Date Tue, 29 Jul 2008 19:04:43 GMT
Hi, Phillipe.
              Thanks for the dataset. The version you are using uses a Dense
Matrix formulation of CBayes. Inorder to load a huge dataset. I had done
some tweaks which reduces the Model size by a huge margin thereby increasing
some overfitting. Please read the Mahout-dev list for more information.

Robin

On Tue, Jul 29, 2008 at 10:43 PM, Philippe Lamarche <
philippe.lamarche@gmail.com> wrote:

>  Hi Robin,
>
> I found out that I had a problem with my Mahout setup. While
> reinstalling from svn and applying the patches, I found out that
> MAHOUT-9 add stuff to "core/src" while MAHOUT-60 add stuff to "src".
>
> On my setup, the ant script "build.xml" is included in "core" and it
> will ignore completely anything added by MAHOUT-60: the target
> "compile-examples" will use a source path of
> "[workspace]/core/src/main/examples/bayes" while the file added by
> MAHOUT-60 are in "[workspace]/src/main/examples/bayes".
>
> After sorting this out, I was able to make CBayes work. With an
> accuracy of over 90% on the split I provided earlier. This is
> impressive! I wonder why I am getting a bigger score than was you
> posted here.
>
> However, I am still having trouble with the Enron corpus: everything
> is predicted to either of the two classes with the highest weight
> normalization, "1_1" and "1_4" (I might be totally wrong with
> assumption, "1_1" and "1_4" might be selected out of luck...).
>
> Here is a link to a split I made out the UC Berkely annotated Enron
> corpus. The emails are edited, in a way that they don't contain the
> header, which gave me a little accuracy augmentation while testing
> with Mallet. The archive also include logs from my tests on CBayes
> with both split.
>
> http://www.2shared.com/file/3671623/69773258/TrainAndTesttar.html
>
> thanks,
> Philippe.
>
>
>
>
>
>
>
>
>
>
> On Sun, Jul 27, 2008 at 7:40 PM, Philippe Lamarche
> <philippe.lamarche@gmail.com> wrote:
> >  Hi,
> >
> > I am glad to see that to see you were able to make it working, I will
> > try it as soon as possible. Probably something went wrong while
> > downloading/applying/updating Mahout-60.
> >
> > I am using the UC Berkeley annotated subset from that you can find in
> > your link, here:
> > http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
> > from here http://bailando.sims.berkeley.edu/enron_email.html.
> >
> > It's a multiple level label, each message can have a:
> > Coarse genre,
> > Included/forwarded information,
> > Primary topics,
> > Emotional tone (if not neutral)
> >
> > There is a .cats file associated with each label.
> >
> > I made a little utility that let you pick a label type, parse the cats
> > file and output the message in appropriate labeled folder. Also, it's
> > easy to just use the 1 to 8 subfolders in the tar, these folders are
> > labeled by coarse genre. I can share this little app, if you want.
> >
> > I am very curious to see if I will be able to make it work.
> >
> > Thanks for the help,
> > Philippe
> >
> >
> > On Sun, Jul 27, 2008 at 11:29 AM, Robin Anil <robin.anil@gmail.com>
> wrote:
> >> Also could you tell me which version of the enron Email corpus are you
> using
> >> for classification. Please provide the link. I found tons of variations
> >> online. What classification labels are you using (Email User Name?).
> >> http://sgi.nu/enron/corpora.php
> >>
> >
>



-- 
Robin Anil
Senior Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message