mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Moulart <kevinmoul...@gmail.com>
Subject Re: Use Naïve Bayes on a large CSV
Date Mon, 24 Feb 2014 14:41:52 GMT
I'll do that as soon as I manage to make it work ^^', that's a great idea !

I'm stuck with this for now :

public static void main(String[] args) throws IOException,
> InterruptedException, ClassNotFoundException {
> Configuration conf = new Configuration(true);
>  FileSystem fs = FileSystem.get(conf);
> BufferedReader reader = new BufferedReader(new FileReader(args[1]));
> Path filePath = new Path(args[2]);
>  if (fs.exists(filePath))
> fs.delete(filePath, true);
> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
>  filePath, Text.class, VectorWritable.class);
> try {
> String line;
> while ((line = reader.readLine()) != null) {
>  String[] c = line.split(args[3]);
> if (c.length > 1) {
> double[] d = new double[c.length];
>  for (int i = 1; i < c.length; i++)
> d[i] = Double.parseDouble(c[i]);
> Vector vec = new RandomAccessSparseVector(c.length);
>  vec.assign(d);
> VectorWritable writable = new VectorWritable();
> writable.set(vec);
>  writer.append(new Text(c[0]), writable);
> }
> }
> writer.close();
>  } catch (Throwable t) {
> t.printStackTrace();
> }
> reader.close();
>  }


Which produces a sequence file but Mahout's trainnb doesn't seem to like it
that much, so I'm working on it for the moment.


2014-02-24 15:37 GMT+01:00 Ted Dunning <ted.dunning@gmail.com>:

> Kevin,
>
> While this is fresh in your mind can you prepare a javadoc patch that would
> have helped you out?  And suggest other doc patches as well?
>
>
>
> On Mon, Feb 24, 2014 at 3:00 AM, Kevin Moulart <kevinmoulart@gmail.com
> >wrote:
>
> > Thanks, that's about the clearest answer I got so far :)
> >
> >
> > 2014-02-24 11:59 GMT+01:00 Sebastian Schelter <ssc@apache.org>:
> >
> > > NaiveBayes expects a SequenceFile as input. The key is the class label
> as
> > > Text, the value are the features as VectorWritable.
> > >
> > > --sebastian
> > >
> > >
> > > On 02/24/2014 11:51 AM, Kevin Moulart wrote:
> > >
> > >> Hi again,
> > >> I finally set my mind on going through java to make a sequence file
> for
> > >> the
> > >> naive bayes,
> > >> but I still can't manage to find anyplace stating exactly what should
> be
> > >> in
> > >> the sequence file
> > >> for mahout to process it with Naive Bayes.
> > >>
> > >> I tried virtually every piece of code i found related to this subject,
> > >> with
> > >> no luck.
> > >>
> > >> My CSV file is like this :
> > >> Label that I want to predict, feature 1, feature 2, ..., feature 1628
> > >>
> > >> Could someone tell me exactly what Naive Bayes training procedure
> > expects
> > >> ?
> > >>
> > >>
> > >> 2014-02-20 13:56 GMT+01:00 Jay Vyas <jayunit100@gmail.com>:
> > >>
> > >>  This relates to a previous question I have:  Does mahout have a
> concept
> > >>> of
> > >>> adapters which allow us to read data csv style data with filters to
> > >>> create
> > >>> exact format  for its various inputs (i.e. Recommender three column
> > >>> format).?  If not is it worth a jira?
> > >>>
> > >>>
> > >>>  On Feb 20, 2014, at 7:50 AM, Kevin Moulart <kevinmoulart@gmail.com>
> > >>>>
> > >>> wrote:
> > >>>
> > >>>>
> > >>>> Hi and thanks !
> > >>>>
> > >>>> What about the command line, is there a way to do that using the
> > >>>> existing
> > >>>> command line ?
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <suneel_marthi@yahoo.com>:
> > >>>>
> > >>>>  To convert input CSV to vectors, u can either:
> > >>>>>
> > >>>>> a) Use CSVIterator
> > >>>>> b) use InputDriver
> > >>>>>
> > >>>>> Either of the above should generate vectors from input CSV
that
> could
> > >>>>>
> > >>>> then
> > >>>
> > >>>> be fed into Mahout classifier/clustering jobs.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
> > >>>>> kevinmoulart@gmail.com> wrote:
> > >>>>>
> > >>>>> Hi I'm trying to apply a Naive Bayes Classifier to a large
CSV file
> > >>>>> from
> > >>>>> the command line.
> > >>>>>
> > >>>>> I know I have to feed the classifier with a seq file, so I
tried to
> > put
> > >>>>>
> > >>>> my
> > >>>
> > >>>> csv into one using the command seqdirectory, but even when I try
> with
> > a
> > >>>>> really small csv (less than 100Mo) I instantly get an
> > >>>>>
> > >>>> outOfMemoryException
> > >>>
> > >>>> from java heap space :
> > >>>>>
> > >>>>> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o
> > >>>>>
> > >>>> "/user/cacf/resSeq"
> > >>>
> > >>>> -ow
> > >>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> > >>>>>> Running on hadoop, using
> > >>>>>>
> > >>>>> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
> > >>>
> > >>>> and HADOOP_CONF_DIR=/etc/hadoop/conf
> > >>>>>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
> > >>>>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line
arguments:
> > >>>>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
> > >>>>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
> > >>>>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
> > >>>>>> --output=[/user/cacf/resSeq],
> > >>>>>>
> > >>>>> --overwrite=null, --startPhase=[0],
> > >>>>>
> > >>>>>> --tempDir=[temp]}
> > >>>>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting
> /user/cacf/resSeq
> > >>>>>> Exception in thread "main" java.lang.OutOfMemoryError:
Java heap
> > space
> > >>>>>> at java.util.Arrays.copyOf(Arrays.java:2367)
> > >>>>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  java.lang.AbstractStringBuilder.expandCapacity(
> > >>> AbstractStringBuilder.java:130)
> > >>>
> > >>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  java.lang.AbstractStringBuilder.ensureCapacityInternal(
> > >>> AbstractStringBuilder.java:114)
> > >>>
> > >>>> at
> > >>>>>>
> > >>>>>
> > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
> > >>>>>
> > >>>>>> at java.lang.StringBuilder.append(StringBuilder.java:132)
> > >>>>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  org.apache.mahout.text.PrefixAdditionFilter.process(
> > >>> PrefixAdditionFilter.java:62)
> > >>>
> > >>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(
> > >>> SequenceFilesFromDirectoryFilter.java:90)
> > >>>
> > >>>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
> > >>>>>> at
> > >>>>>>
> > >>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
> > >>>>>
> > >>>>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.run(
> > >>> SequenceFilesFromDirectory.java:98)
> > >>>
> > >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> > >>>>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.main(
> > >>> SequenceFilesFromDirectory.java:53)
> > >>>
> > >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >>>>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
> > >>> NativeMethodAccessorImpl.java:57)
> > >>>
> > >>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > >>> DelegatingMethodAccessorImpl.java:43)
> > >>>
> > >>>> at java.lang.reflect.Method.invoke(Method.java:606)
> > >>>>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(
> > >>> ProgramDriver.java:72)
> > >>>
> > >>>> at
> > >>>>>>
> > >>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> > >>>>>
> > >>>>>> at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
> > >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
> > >>>>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
> > >>> NativeMethodAccessorImpl.java:57)
> > >>>
> > >>>> at
> > >>>>>>
> > >>>>>
> > >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > >>> DelegatingMethodAccessorImpl.java:43)
> > >>>
> > >>>> at java.lang.reflect.Method.invoke(Method.java:606)
> > >>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>> Do you have an idea or a simple way to use Naive Bayes against
my
> > large
> > >>>>>
> > >>>> CSV
> > >>>
> > >>>> ?
> > >>>>>
> > >>>>> Thanks in advance !
> > >>>>> --
> > >>>>> Kévin Moulart
> > >>>>> GSM France : +33 7 81 06 10 10
> > >>>>> GSM Belgique : +32 473 85 23 85
> > >>>>> Téléphone fixe : +32 2 771 88 45
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Kévin Moulart
> > >>>> GSM France : +33 7 81 06 10 10
> > >>>> GSM Belgique : +32 473 85 23 85
> > >>>> Téléphone fixe : +32 2 771 88 45
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >
> >
> >
> > --
> > Kévin Moulart
> > GSM France : +33 7 81 06 10 10
> > GSM Belgique : +32 473 85 23 85
> > Téléphone fixe : +32 2 771 88 45
> >
>



-- 
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message