mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philippe Lamarche" <philippe.lamar...@gmail.com>
Subject Re: Problems with the Bayesian classifiers.
Date Tue, 29 Jul 2008 17:13:28 GMT
 Hi Robin,

I found out that I had a problem with my Mahout setup. While
reinstalling from svn and applying the patches, I found out that
MAHOUT-9 add stuff to "core/src" while MAHOUT-60 add stuff to "src".

On my setup, the ant script "build.xml" is included in "core" and it
will ignore completely anything added by MAHOUT-60: the target
"compile-examples" will use a source path of
"[workspace]/core/src/main/examples/bayes" while the file added by
MAHOUT-60 are in "[workspace]/src/main/examples/bayes".

After sorting this out, I was able to make CBayes work. With an
accuracy of over 90% on the split I provided earlier. This is
impressive! I wonder why I am getting a bigger score than was you
posted here.

However, I am still having trouble with the Enron corpus: everything
is predicted to either of the two classes with the highest weight
normalization, "1_1" and "1_4" (I might be totally wrong with
assumption, "1_1" and "1_4" might be selected out of luck...).

Here is a link to a split I made out the UC Berkely annotated Enron
corpus. The emails are edited, in a way that they don't contain the
header, which gave me a little accuracy augmentation while testing
with Mallet. The archive also include logs from my tests on CBayes
with both split.

http://www.2shared.com/file/3671623/69773258/TrainAndTesttar.html

thanks,
Philippe.










On Sun, Jul 27, 2008 at 7:40 PM, Philippe Lamarche
<philippe.lamarche@gmail.com> wrote:
>  Hi,
>
> I am glad to see that to see you were able to make it working, I will
> try it as soon as possible. Probably something went wrong while
> downloading/applying/updating Mahout-60.
>
> I am using the UC Berkeley annotated subset from that you can find in
> your link, here:
> http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
> from here http://bailando.sims.berkeley.edu/enron_email.html.
>
> It's a multiple level label, each message can have a:
> Coarse genre,
> Included/forwarded information,
> Primary topics,
> Emotional tone (if not neutral)
>
> There is a .cats file associated with each label.
>
> I made a little utility that let you pick a label type, parse the cats
> file and output the message in appropriate labeled folder. Also, it's
> easy to just use the 1 to 8 subfolders in the tar, these folders are
> labeled by coarse genre. I can share this little app, if you want.
>
> I am very curious to see if I will be able to make it work.
>
> Thanks for the help,
> Philippe
>
>
> On Sun, Jul 27, 2008 at 11:29 AM, Robin Anil <robin.anil@gmail.com> wrote:
>> Also could you tell me which version of the enron Email corpus are you using
>> for classification. Please provide the link. I found tons of variations
>> online. What classification labels are you using (Email User Name?).
>> http://sgi.nu/enron/corpora.php
>>
>

Mime
View raw message