mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: sgd.TrainNewsGroups error
Date Thu, 09 Dec 2010 23:50:47 GMT
The problem is that this is example code designed to work on the training
data set.  The test data set is smaller.

To fix this, change the line in question:

   for (File file : files.subList(0, 10000)) {

to this:

   int samples = Math.max(files.length(), 10000);
   for (File file: files.subList(0, samples)) {

Or even remove the limit:

   for (File file : files) {

The first option handles the first 10,000 or all whichever is smaller and
the second option uses all of the data.


The reason that this limit is in there is because I was running this program
roughly a hundred billion times in tuning the SGD implementation and writing
chapters 13-16 of the MiA book and often needed to be able to do an
abbreviated training run.  I should have removed it some time ago.


On Thu, Dec 9, 2010 at 1:59 PM, ivek gimmick <gimmickivek@gmail.com> wrote:

> I am trying to execute the above code as
>
> -distribution-0.4 $ bin/mahout
> org.apache.mahout.classifier.sgd.TrainNewsGroups
> examples/bin/work/20news-bydate/20news-bydate-test 2
>
> no HADOOP_HOME set, running locally
> Dec 9, 2010 4:53:29 PM org.slf4j.impl.JCLLoggerAdapter warn
> WARNING: No org.apache.mahout.classifier.sgd.TrainNewsGroups.props found on
> classpath, will use command-line arguments only
> *7532* training files
> Exception in thread "main" java.lang.IndexOutOfBoundsException: toIndex = *
> 10000*
> at java.util.SubList.<init>(AbstractList.java:602)
> at java.util.RandomAccessSubList.<init>(AbstractList.java:758)
> at java.util.AbstractList.subList(AbstractList.java:468)
> at
>
> org.apache.mahout.classifier.sgd.TrainNewsGroups.main(TrainNewsGroups.java:159)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>
>
>
> The limit is 10000 > 7532, I am not sure why this give IndexOutofBounds .
>    for (File file : files.subList(0, 10000)) {  .... is line 159 of
> TrainNewsGroups.java
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message