mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Moulart <kevinmoul...@gmail.com>
Subject Re: Use Naïve Bayes on a large CSV
Date Tue, 25 Feb 2014 15:25:28 GMT
I finally managed to make it run, I had to format the class label in the
input file with a / in the name so I put Yes/1 or No/0 instead of just 1 or
0.

But then I noticed when testing the model that it doesn't classify all the
data :
14/02/25 16:16:30 INFO mapred.JobClient:   Map-Reduce Framework
14/02/25 16:16:30 INFO mapred.JobClient:     Map input records=*300000*
14/02/25 16:16:30 INFO mapred.JobClient:     Map output records=300000
14/02/25 16:16:30 INFO mapred.JobClient:     Input split bytes=476
14/02/25 16:16:30 INFO mapred.JobClient:     Spilled Records=0
14/02/25 16:16:30 INFO mapred.JobClient:     CPU time spent (ms)=32000
14/02/25 16:16:30 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=834502656
14/02/25 16:16:30 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=3738030080
14/02/25 16:16:30 INFO mapred.JobClient:     Total committed heap usage
(bytes)=918552576
14/02/25 16:16:31 INFO test.TestNaiveBayesDriver: Standard NB Results:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      36078   91.3552%
Incorrectly Classified Instances        :       3414    8.6448%
Total Classified Instances              :      *39492*

=======================================================
Confusion Matrix
-------------------------------------------------------
a     b     <--Classified as
34445 2114  |  36559 a     = 0
1300 1633  |  2933   b     = 1


I did the testnb with the exact same file I used to train the model.

Any idea ?


2014-02-25 11:33 GMT+01:00 Kevin Moulart <kevinmoulart@gmail.com>:

> All right I've manage to narrow it down to the LabelIndex, I went to see
> the code but it isnt realy clear at all for me. What exactly should I
> provide as Label Index ?
>
> As a reminder, one line of my original file i=looks like :
> 0, 0.3222, 0, 1.543, ...
> 1, 0, 1.42, 1.12, ...
>
> With the 0, 1 being the labels I'm trying to learn and the rest being the
> data.
>
> For now I have the previously mentionned java code that creates the
> SequenceFile from my CSV, but when I then try to run the trainnb on it it
> tries to create a LabelIndex and fails with an ArrayOutOfBoundException: 1.
>
> Could someone tell me how to create the index, even manually at this point
> ?
>
> Thanks in advance !
>
>
> 2014-02-24 15:41 GMT+01:00 Kevin Moulart <kevinmoulart@gmail.com>:
>
> I'll do that as soon as I manage to make it work ^^', that's a great idea !
>>
>> I'm stuck with this for now :
>>
>> public static void main(String[] args) throws IOException,
>>> InterruptedException, ClassNotFoundException {
>>> Configuration conf = new Configuration(true);
>>>  FileSystem fs = FileSystem.get(conf);
>>> BufferedReader reader = new BufferedReader(new FileReader(args[1]));
>>> Path filePath = new Path(args[2]);
>>>  if (fs.exists(filePath))
>>> fs.delete(filePath, true);
>>> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
>>>  filePath, Text.class, VectorWritable.class);
>>> try {
>>> String line;
>>> while ((line = reader.readLine()) != null) {
>>>  String[] c = line.split(args[3]);
>>> if (c.length > 1) {
>>> double[] d = new double[c.length];
>>>  for (int i = 1; i < c.length; i++)
>>> d[i] = Double.parseDouble(c[i]);
>>> Vector vec = new RandomAccessSparseVector(c.length);
>>>  vec.assign(d);
>>> VectorWritable writable = new VectorWritable();
>>> writable.set(vec);
>>>  writer.append(new Text(c[0]), writable);
>>> }
>>> }
>>> writer.close();
>>>  } catch (Throwable t) {
>>> t.printStackTrace();
>>> }
>>> reader.close();
>>>  }
>>
>>
>> Which produces a sequence file but Mahout's trainnb doesn't seem to like
>> it that much, so I'm working on it for the moment.
>>
>>
>> 2014-02-24 15:37 GMT+01:00 Ted Dunning <ted.dunning@gmail.com>:
>>
>> Kevin,
>>>
>>> While this is fresh in your mind can you prepare a javadoc patch that
>>> would
>>> have helped you out?  And suggest other doc patches as well?
>>>
>>>
>>>
>>> On Mon, Feb 24, 2014 at 3:00 AM, Kevin Moulart <kevinmoulart@gmail.com
>>> >wrote:
>>>
>>> > Thanks, that's about the clearest answer I got so far :)
>>> >
>>> >
>>> > 2014-02-24 11:59 GMT+01:00 Sebastian Schelter <ssc@apache.org>:
>>> >
>>> > > NaiveBayes expects a SequenceFile as input. The key is the class
>>> label as
>>> > > Text, the value are the features as VectorWritable.
>>> > >
>>> > > --sebastian
>>> > >
>>> > >
>>> > > On 02/24/2014 11:51 AM, Kevin Moulart wrote:
>>> > >
>>> > >> Hi again,
>>> > >> I finally set my mind on going through java to make a sequence
file
>>> for
>>> > >> the
>>> > >> naive bayes,
>>> > >> but I still can't manage to find anyplace stating exactly what
>>> should be
>>> > >> in
>>> > >> the sequence file
>>> > >> for mahout to process it with Naive Bayes.
>>> > >>
>>> > >> I tried virtually every piece of code i found related to this
>>> subject,
>>> > >> with
>>> > >> no luck.
>>> > >>
>>> > >> My CSV file is like this :
>>> > >> Label that I want to predict, feature 1, feature 2, ..., feature
>>> 1628
>>> > >>
>>> > >> Could someone tell me exactly what Naive Bayes training procedure
>>> > expects
>>> > >> ?
>>> > >>
>>> > >>
>>> > >> 2014-02-20 13:56 GMT+01:00 Jay Vyas <jayunit100@gmail.com>:
>>> > >>
>>> > >>  This relates to a previous question I have:  Does mahout have
a
>>> concept
>>> > >>> of
>>> > >>> adapters which allow us to read data csv style data with filters
to
>>> > >>> create
>>> > >>> exact format  for its various inputs (i.e. Recommender three
column
>>> > >>> format).?  If not is it worth a jira?
>>> > >>>
>>> > >>>
>>> > >>>  On Feb 20, 2014, at 7:50 AM, Kevin Moulart <
>>> kevinmoulart@gmail.com>
>>> > >>>>
>>> > >>> wrote:
>>> > >>>
>>> > >>>>
>>> > >>>> Hi and thanks !
>>> > >>>>
>>> > >>>> What about the command line, is there a way to do that
using the
>>> > >>>> existing
>>> > >>>> command line ?
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <suneel_marthi@yahoo.com
>>> >:
>>> > >>>>
>>> > >>>>  To convert input CSV to vectors, u can either:
>>> > >>>>>
>>> > >>>>> a) Use CSVIterator
>>> > >>>>> b) use InputDriver
>>> > >>>>>
>>> > >>>>> Either of the above should generate vectors from input
CSV that
>>> could
>>> > >>>>>
>>> > >>>> then
>>> > >>>
>>> > >>>> be fed into Mahout classifier/clustering jobs.
>>> > >>>>>
>>> > >>>>>
>>> > >>>>>
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart
<
>>> > >>>>> kevinmoulart@gmail.com> wrote:
>>> > >>>>>
>>> > >>>>> Hi I'm trying to apply a Naive Bayes Classifier to
a large CSV
>>> file
>>> > >>>>> from
>>> > >>>>> the command line.
>>> > >>>>>
>>> > >>>>> I know I have to feed the classifier with a seq file,
so I tried
>>> to
>>> > put
>>> > >>>>>
>>> > >>>> my
>>> > >>>
>>> > >>>> csv into one using the command seqdirectory, but even when
I try
>>> with
>>> > a
>>> > >>>>> really small csv (less than 100Mo) I instantly get
an
>>> > >>>>>
>>> > >>>> outOfMemoryException
>>> > >>>
>>> > >>>> from java heap space :
>>> > >>>>>
>>> > >>>>> mahout seqdirectory -i "/user/cacf/Echant/testSeq"
-o
>>> > >>>>>
>>> > >>>> "/user/cacf/resSeq"
>>> > >>>
>>> > >>>> -ow
>>> > >>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR
to classpath.
>>> > >>>>>> Running on hadoop, using
>>> > >>>>>>
>>> > >>>>> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
>>> > >>>
>>> > >>>> and HADOOP_CONF_DIR=/etc/hadoop/conf
>>> > >>>>>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
>>> > >>>>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command
line
>>> arguments:
>>> > >>>>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
>>> > >>>>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
>>> > >>>>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
>>> > >>>>>> --output=[/user/cacf/resSeq],
>>> > >>>>>>
>>> > >>>>> --overwrite=null, --startPhase=[0],
>>> > >>>>>
>>> > >>>>>> --tempDir=[temp]}
>>> > >>>>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting
>>> /user/cacf/resSeq
>>> > >>>>>> Exception in thread "main" java.lang.OutOfMemoryError:
Java heap
>>> > space
>>> > >>>>>> at java.util.Arrays.copyOf(Arrays.java:2367)
>>> > >>>>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  java.lang.AbstractStringBuilder.expandCapacity(
>>> > >>> AbstractStringBuilder.java:130)
>>> > >>>
>>> > >>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  java.lang.AbstractStringBuilder.ensureCapacityInternal(
>>> > >>> AbstractStringBuilder.java:114)
>>> > >>>
>>> > >>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
>>> > >>>>>
>>> > >>>>>> at java.lang.StringBuilder.append(StringBuilder.java:132)
>>> > >>>>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  org.apache.mahout.text.PrefixAdditionFilter.process(
>>> > >>> PrefixAdditionFilter.java:62)
>>> > >>>
>>> > >>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(
>>> > >>> SequenceFilesFromDirectoryFilter.java:90)
>>> > >>>
>>> > >>>> at
>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
>>> > >>>>>> at
>>> > >>>>>>
>>> > >>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
>>> > >>>>>
>>> > >>>>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.run(
>>> > >>> SequenceFilesFromDirectory.java:98)
>>> > >>>
>>> > >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>> > >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>> > >>>>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.main(
>>> > >>> SequenceFilesFromDirectory.java:53)
>>> > >>>
>>> > >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
>>> > >>>>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
>>> > >>> NativeMethodAccessorImpl.java:57)
>>> > >>>
>>> > >>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> > >>> DelegatingMethodAccessorImpl.java:43)
>>> > >>>
>>> > >>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> > >>>>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(
>>> > >>> ProgramDriver.java:72)
>>> > >>>
>>> > >>>> at
>>> > >>>>>>
>>> > >>>>>
>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
>>> > >>>>>
>>> > >>>>>> at
>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
>>> > >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
>>> > >>>>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
>>> > >>> NativeMethodAccessorImpl.java:57)
>>> > >>>
>>> > >>>> at
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> > >>> DelegatingMethodAccessorImpl.java:43)
>>> > >>>
>>> > >>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> > >>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> Do you have an idea or a simple way to use Naive Bayes
against my
>>> > large
>>> > >>>>>
>>> > >>>> CSV
>>> > >>>
>>> > >>>> ?
>>> > >>>>>
>>> > >>>>> Thanks in advance !
>>> > >>>>> --
>>> > >>>>> Kévin Moulart
>>> > >>>>> GSM France : +33 7 81 06 10 10
>>> > >>>>> GSM Belgique : +32 473 85 23 85
>>> > >>>>> Téléphone fixe : +32 2 771 88 45
>>> > >>>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> --
>>> > >>>> Kévin Moulart
>>> > >>>> GSM France : +33 7 81 06 10 10
>>> > >>>> GSM Belgique : +32 473 85 23 85
>>> > >>>> Téléphone fixe : +32 2 771 88 45
>>> > >>>>
>>> > >>>
>>> > >>>
>>> > >>
>>> > >>
>>> > >>
>>> > >
>>> >
>>> >
>>> > --
>>> > Kévin Moulart
>>> > GSM France : +33 7 81 06 10 10
>>> > GSM Belgique : +32 473 85 23 85
>>> > Téléphone fixe : +32 2 771 88 45
>>> >
>>>
>>
>>
>>
>> --
>> Kévin Moulart
>> GSM France : +33 7 81 06 10 10
>> GSM Belgique : +32 473 85 23 85
>> Téléphone fixe : +32 2 771 88 45
>>
>
>
>
> --
> Kévin Moulart
> GSM France : +33 7 81 06 10 10
> GSM Belgique : +32 473 85 23 85
> Téléphone fixe : +32 2 771 88 45
>



-- 
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message