mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: AdaptiveLogisticRegression.close() ArrayIndexOutOfBoundsException
Date Sun, 01 May 2011 21:01:26 GMT
Tim,

What is your preference?

Do you actually have lots of data?

On Sun, May 1, 2011 at 2:00 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> OK.  I think that this is a bug in Mahout.  The problem is very likely that
> you are putting in very little training data.  The
> AdaptiveLogisticRegression batches training examples up so that threading
> runs efficiently.  Unfortunately, this means that nothing happens right away
> and appears to mean when you close the model, nothing has been done and the
> examples aren't flushed correctly.
>
> The work-around is to use lots of data.
>
> The fix is for me to noodle on what it should mean when we only see a tiny
> amount of data.  The entire point of on-line learning gets a little hazy at
> these levels.  I can see several options:
>
> a) just bitch and say that we don't support tiny problems (bad marketing at
> the least)
>
> b) use the buffered training examples and run several iterations while
> recycling the data in randomized order.  This makes restarting learning kind
> of strange.
>
> c) just do the requested single pass of learning and let the user figure
> things out.
>
> My inclination is toward (c), but there are lots of implications that make
> this kind of tricky.
>
>
> On Sun, May 1, 2011 at 12:58 PM, Tim Snyder <tim@proinnovations.com>wrote:
>
>> In the interests of timeliness (I would have to figure out how to use
>> github to apache), and the fact that the code is not that long - I will
>> just post it here. It is pretty much the code from the examples and
>> Mahout in Action with a few modifications. Beware of non-standard
>> formatting though.
>>
>> As to the number of items I have been running through the trainer -
>> about 200. I am just trying to get my first trainer, evaluator,
>> production end to end going before I start loading it up.
>>
>>
>> package com.zensa.spinn3r.mahout;
>>
>> import com.google.common.collect.Maps;
>> import com.zensa.config.ConfigProperties;
>>
>> import org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression;
>> import org.apache.mahout.classifier.sgd.CrossFoldLearner;
>> import org.apache.mahout.classifier.sgd.L1;
>> import org.apache.mahout.classifier.sgd.ModelDissector;
>> import org.apache.mahout.classifier.sgd.ModelSerializer;
>> import org.apache.mahout.classifier.sgd.OnlineLogisticRegression;
>> import org.apache.mahout.ep.State;
>> import org.apache.mahout.math.Matrix;
>> import org.apache.mahout.math.RandomAccessSparseVector;
>> import org.apache.mahout.math.Vector;
>> import org.apache.mahout.math.function.DoubleFunction;
>> import org.apache.mahout.math.function.Functions;
>>
>>
>> import org.apache.mahout.vectorizer.encoders.ConstantValueEncoder;
>> import org.apache.mahout.vectorizer.encoders.Dictionary;
>> import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
>> import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
>>
>> import java.io.IOException;
>> import java.util.List;
>> import java.util.Map;
>> import java.util.Set;
>>
>>
>> public class SnomedSGDClassificationTrainer
>> {
>>  private static final int                  FEATURES       = 1000;
>>
>>  private AdaptiveLogisticRegression learningAlgorithm = null;;
>>
>>  private static final FeatureVectorEncoder encoder = new
>> StaticWordValueEncoder( "code" );
>>  private static final FeatureVectorEncoder bias    = new
>> ConstantValueEncoder( "Intercept" );
>>
>>  private Dictionary msgs = new Dictionary();
>>
>>  private double averageLL = 0;
>>  private double averageCorrect = 0;
>>
>>  private int    k     = 0;
>>  private double step  = 0;
>>  private int[]  bumps = { 1, 2, 5 };
>>
>>  private String modelFile = null;
>>
>>
>>   public void initialize()
>>  {
>>    encoder.setProbes( 2 );
>>
>>    learningAlgorithm = new AdaptiveLogisticRegression( 2, FEATURES, new
>> L1() );
>>     learningAlgorithm.setInterval( 800 );
>>    learningAlgorithm.setAveragingWindow( 500 );
>>
>>    modelFile = ConfigProperties.getInstance().getProperty(
>> "spinn3r.classifier.model_file" );
>>  }
>>
>>
>>  public void train( String msgId, Map<String, Integer> codes )
>>    throws IOException
>>  {
>>    int actual = msgs.intern( msgId );
>>
>>    Vector v = encodeFeatureVector( codes, actual );
>>    learningAlgorithm.train( actual, v );
>>
>>    k++;
>>
>>    int bump = bumps[(int)Math.floor( step ) % bumps.length];
>>    int scale = (int)Math.pow( 10, Math.floor( step / bumps.length ) );
>>
>>    State<AdaptiveLogisticRegression.Wrapper, CrossFoldLearner> best =
>> learningAlgorithm.getBest();
>>
>>    double maxBeta;
>>    double norm;
>>    double nonZeros;
>>    double positive;
>>    double lambda = 0;
>>    double mu = 0;
>>
>>    if( best != null )
>>    {
>>      CrossFoldLearner state = best.getPayload().getLearner();
>>
>>      averageCorrect = state.percentCorrect();
>>      averageLL      = state.logLikelihood();
>>
>>      OnlineLogisticRegression model = state.getModels().get( 0 );
>>      model.close();
>>
>>      Matrix beta = model.getBeta();
>>      maxBeta  = beta.aggregate( Functions.MAX, Functions.ABS );
>>      norm     = beta.aggregate( Functions.PLUS, Functions.ABS );
>>      nonZeros = beta.aggregate( Functions.PLUS, new DoubleFunction()
>>                                                 {
>>                                                   @Override
>>                                                   public double apply(
>> double v )
>>                                                   {
>>                                                     return Math.abs(v)
>> > 1.0e-6 ? 1 : 0;
>>                                                   }
>>                                                 });
>>      positive = beta.aggregate( Functions.PLUS, new DoubleFunction()
>>                                                 {
>>                                                   @Override
>>                                                   public double apply(
>> double v )
>>                                                   {
>>                                                     return v > 0 ? 1 :
>> 0;
>>                                                   }
>>                                                 });
>>
>>      lambda = learningAlgorithm.getBest().getMappedParams()[0];
>>      mu     = learningAlgorithm.getBest().getMappedParams()[1];
>>    }
>>    else
>>    {
>>      maxBeta  = 0;
>>      nonZeros = 0;
>>      positive = 0;
>>      norm     = 0;
>>    }
>>
>>    if( k % ( bump * scale ) == 0 )
>>    {
>>      if( learningAlgorithm.getBest() != null )
>>      {
>>
>> ModelSerializer.writeBinary("/Users/tim/spinn3r_data/model/snomed-" + k
>> + ".model", learningAlgorithm.getBest().getPayload().getLearner());
>>      }
>>
>>      step += 0.25;
>>      System.out.printf( "%.2f\t%.2f\t%.2f\t%.2f\t%.8g\t%.8g\t",
>> maxBeta, nonZeros, positive, norm, lambda, mu );
>>      System.out.printf( "%d\t%.3f\t%.2f\n", k, averageLL,
>> averageCorrect * 100 );
>>    }
>>  }
>>
>>
>>  public void finishTraining( String msgId, Map<String, Integer>codes )
>>    throws IOException
>>  {
>>    try
>>    {
>>      learningAlgorithm.close();
>>    } catch( Exception e )
>>      {
>>        System.out.println( "SnomedClassificationTrainer.train() -
>> learningAlgorithm.close() Error = " + e.getMessage() );
>>        e.printStackTrace();
>>      }
>>
>>    dissect( msgId, learningAlgorithm, codes );
>>
>>    ModelSerializer.writeBinary( modelFile,
>> learningAlgorithm.getBest().getPayload().getLearner() );
>>  }
>>
>>
>>  private Vector encodeFeatureVector( Map<String, Integer> codes, int
>> actual )
>>  {
>>    Vector v = new RandomAccessSparseVector( FEATURES );
>>
>>    bias.addToVector( "", 1, v );
>>
>>    for( String code : codes.keySet() )
>>    {
>>      Integer count = codes.get( code );
>>      encoder.addToVector( code, Math.log( 1 + count.intValue() ), v );
>>    }
>>
>>    return v;
>>  }
>>
>>
>>  private void dissect( String msgId, AdaptiveLogisticRegression
>> learningAlgorithm, Map<String, Integer> codes )
>>  {
>>    CrossFoldLearner model =
>> learningAlgorithm.getBest().getPayload().getLearner();
>>    model.close();
>>
>>    Map<String, Set<Integer>> traceDictionary = Maps.newTreeMap();
>>    ModelDissector md = new ModelDissector();
>>
>>    encoder.setTraceDictionary( traceDictionary );
>>    bias.setTraceDictionary( traceDictionary );
>>
>>    int actual = msgs.intern( msgId );
>>
>>    traceDictionary.clear();
>>    Vector v = encodeFeatureVector( codes, actual );
>>    md.update( v, traceDictionary, model );
>>
>>    List<ModelDissector.Weight> weights = md.summary( 1000 );
>>    for( ModelDissector.Weight w : weights )
>>    {
>>      System.out.printf( "%s\t%.1f\t%.1f\t%s\t%.1f\t%s\n",
>> w.getFeature(), w.getWeight(), w.getCategory( 1 ), w.getWeight( 1 ),
>> w.getCategory( 2 ), w.getWeight( 2 ) );
>>    }
>>  }
>>
>> }
>>
>> The End.
>>
>> Tim
>>
>>
>> -------- Original Message --------
>> Subject: Re: AdaptiveLogisticRegression.close()
>> ArrayIndexOutOfBoundsException
>> From: Ted Dunning <ted.dunning@gmail.com>
>> Date: Sun, May 01, 2011 7:37 pm
>> To: user@mahout.apache.org
>>
>> Can you put your code on github?
>>
>> There is a detail that slipped somewhere and I can't guess where it is.
>>  Your constructor is correct for a binary classifier, but I can't say
>> much
>> else.
>>
>> How much data, btw, did you pour in?
>>
>> On Sun, May 1, 2011 at 3:18 AM, Tim Snyder <tim@proinnovations.com>
>> wrote:
>>
>> > I am currently using trunk from April 30, 2011. The code is loosely
>> > following the SGD training example from Mahout in Action. I have
>> > instantiated the learner with the purpose of having a binary classifier
>> > :
>> >
>> > AdaptiveLogisticRegression learningAlgorithm = new
>> > AdaptiveLogisticRegression( 2, FEATURES, new L1() );
>> >
>> > Everything works fine (ie. the training) until I get to
>> > learningAlogrithm.close() where I get the following exception:
>> >
>> > learningAlgorithm.close() Error =
>> > java.lang.ArrayIndexOutOfBoundsException
>> > java.lang.IllegalStateException:
>> > java.lang.ArrayIndexOutOfBoundsException
>> > Exception = null
>> > at
>> >
>> >
>> org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.trainWithBufferedExamples(AdaptiveLogisticRegression.java:144)
>> > at
>> >
>> >
>> org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.close(AdaptiveLogisticRegression.java:196)
>> > at
>> >
>> >
>> com.zensa.spinn3r.mahout.SnomedSDGClassificationTrainer.finishTraining(SnomedSDGClassificationTrainer.java:159)
>> > at
>> >
>> >
>> com.spinn3r.sdg.trainer.Spinn3rSDGTrainer.process(Spinn3rSDGTrainer.java:170)
>> > at
>> >
>> com.spinn3r.sdg.trainer.Spinn3rSDGTrainer.main(Spinn3rSDGTrainer.java:272)
>> > Caused by: java.lang.ArrayIndexOutOfBoundsException
>> >
>> > If I change the number of categories to 100, the close() works fine. Any
>> > ideas on how to get around this and have a working binary classifier?
>> >
>> > Thanks in advance.
>> >
>> > Tim
>> >
>> >
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message