mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ivek gimmick <gimmicki...@gmail.com>
Subject Re: sgd.TrainNewsGroups error
Date Fri, 10 Dec 2010 02:14:01 GMT
Thanks Ted, really appreciate your help.

Also, similarly in line 259 in TrainNewsGroups.java

259 //    for (File file : permute(files, rand).subList(0, 500)) {
260     for (File file : permute(files, rand)) {


On Thu, Dec 9, 2010 at 6:50 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> The problem is that this is example code designed to work on the training
> data set.  The test data set is smaller.
>
> To fix this, change the line in question:
>
>   for (File file : files.subList(0, 10000)) {
>
> to this:
>
>   int samples = Math.max(files.length(), 10000);
>   for (File file: files.subList(0, samples)) {
>
> Or even remove the limit:
>
>   for (File file : files) {
>
> The first option handles the first 10,000 or all whichever is smaller and
> the second option uses all of the data.
>
>
> The reason that this limit is in there is because I was running this
> program
> roughly a hundred billion times in tuning the SGD implementation and
> writing
> chapters 13-16 of the MiA book and often needed to be able to do an
> abbreviated training run.  I should have removed it some time ago.
>
>
> On Thu, Dec 9, 2010 at 1:59 PM, ivek gimmick <gimmickivek@gmail.com>
> wrote:
>
> > I am trying to execute the above code as
> >
> > -distribution-0.4 $ bin/mahout
> > org.apache.mahout.classifier.sgd.TrainNewsGroups
> > examples/bin/work/20news-bydate/20news-bydate-test 2
> >
> > no HADOOP_HOME set, running locally
> > Dec 9, 2010 4:53:29 PM org.slf4j.impl.JCLLoggerAdapter warn
> > WARNING: No org.apache.mahout.classifier.sgd.TrainNewsGroups.props found
> on
> > classpath, will use command-line arguments only
> > *7532* training files
> > Exception in thread "main" java.lang.IndexOutOfBoundsException: toIndex =
> *
> > 10000*
> > at java.util.SubList.<init>(AbstractList.java:602)
> > at java.util.RandomAccessSubList.<init>(AbstractList.java:758)
> > at java.util.AbstractList.subList(AbstractList.java:468)
> > at
> >
> >
> org.apache.mahout.classifier.sgd.TrainNewsGroups.main(TrainNewsGroups.java:159)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at
> >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >
> >
> >
> > The limit is 10000 > 7532, I am not sure why this give IndexOutofBounds .
> >    for (File file : files.subList(0, 10000)) {  .... is line 159 of
> > TrainNewsGroups.java
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message