Ted,
I do have lots of data  twitter and facebook msgs, however I do not yet
have lots of training data. The problem is the data I am interested in
is very very sparse (say 1 in 10,000 as a guess)  hence the reason a
binary classifier seemed logical. One question I would have is what is a
rule of thumb for necessary size of training data? The 20 Newsgroups
examples in Mahout examples and Mahout In Action (MIA) shows
approximately 10,000.
In your options a through c  I am not sure I understand the difference
between (a) and (c). Is (a) the current state (let it fail), and (c) a
small fix to let things complete, but understand that it is probably not
valuable? My preference would be (c) given the capability to view the
results and understand effectiveness of the model.
Assuming that I could do a previous processing step on the messages,
similar to spam exclusion, to get to a 1 in 50 or 1 in 20 potential
interesting msg content, I could develop a larger training dataset. With
only 1 in 10,000 msgs of interest, I don't think I can get to a 10,000
training set. Any recommendations on how to do this? I am looking at
Chapter 12 of MIA on clustering of Twitter msgs as a possible way of
implementing an unsupervised learning for clustering. I would need to
take this output and be able to discard those clusters (and resultant
msgs) which are not of interest.
BTW  thank you for promptly responding on a Sunday.
Tim
 Original Message 
Subject: Re: AdaptiveLogisticRegression.close()
ArrayIndexOutOfBoundsException
From: Ted Dunning <ted.dunning@gmail.com>
Date: Sun, May 01, 2011 11:01 pm
To: user@mahout.apache.org
Tim,
What is your preference?
Do you actually have lots of data?
On Sun, May 1, 2011 at 2:00 PM, Ted Dunning <ted.dunning@gmail.com>
wrote:
> OK. I think that this is a bug in Mahout. The problem is very likely that
> you are putting in very little training data. The
> AdaptiveLogisticRegression batches training examples up so that threading
> runs efficiently. Unfortunately, this means that nothing happens right away
> and appears to mean when you close the model, nothing has been done and the
> examples aren't flushed correctly.
>
> The workaround is to use lots of data.
>
> The fix is for me to noodle on what it should mean when we only see a tiny
> amount of data. The entire point of online learning gets a little hazy at
> these levels. I can see several options:
>
> a) just bitch and say that we don't support tiny problems (bad marketing at
> the least)
>
> b) use the buffered training examples and run several iterations while
> recycling the data in randomized order. This makes restarting learning kind
> of strange.
>
> c) just do the requested single pass of learning and let the user figure
> things out.
>
> My inclination is toward (c), but there are lots of implications that make
> this kind of tricky.
>
>
> On Sun, May 1, 2011 at 12:58 PM, Tim Snyder <tim@proinnovations.com>wrote:
>
>> In the interests of timeliness (I would have to figure out how to use
>> github to apache), and the fact that the code is not that long  I will
>> just post it here. It is pretty much the code from the examples and
>> Mahout in Action with a few modifications. Beware of nonstandard
>> formatting though.
>>
>> As to the number of items I have been running through the trainer 
>> about 200. I am just trying to get my first trainer, evaluator,
>> production end to end going before I start loading it up.
>>
>>
>> package com.zensa.spinn3r.mahout;
>>
>> import com.google.common.collect.Maps;
>> import com.zensa.config.ConfigProperties;
>>
>> import org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression;
>> import org.apache.mahout.classifier.sgd.CrossFoldLearner;
>> import org.apache.mahout.classifier.sgd.L1;
>> import org.apache.mahout.classifier.sgd.ModelDissector;
>> import org.apache.mahout.classifier.sgd.ModelSerializer;
>> import org.apache.mahout.classifier.sgd.OnlineLogisticRegression;
>> import org.apache.mahout.ep.State;
>> import org.apache.mahout.math.Matrix;
>> import org.apache.mahout.math.RandomAccessSparseVector;
>> import org.apache.mahout.math.Vector;
>> import org.apache.mahout.math.function.DoubleFunction;
>> import org.apache.mahout.math.function.Functions;
>>
>>
>> import org.apache.mahout.vectorizer.encoders.ConstantValueEncoder;
>> import org.apache.mahout.vectorizer.encoders.Dictionary;
>> import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder;
>> import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder;
>>
>> import java.io.IOException;
>> import java.util.List;
>> import java.util.Map;
>> import java.util.Set;
>>
>>
>> public class SnomedSGDClassificationTrainer
>> {
>> private static final int FEATURES = 1000;
>>
>> private AdaptiveLogisticRegression learningAlgorithm = null;;
>>
>> private static final FeatureVectorEncoder encoder = new
>> StaticWordValueEncoder( "code" );
>> private static final FeatureVectorEncoder bias = new
>> ConstantValueEncoder( "Intercept" );
>>
>> private Dictionary msgs = new Dictionary();
>>
>> private double averageLL = 0;
>> private double averageCorrect = 0;
>>
>> private int k = 0;
>> private double step = 0;
>> private int[] bumps = { 1, 2, 5 };
>>
>> private String modelFile = null;
>>
>>
>> public void initialize()
>> {
>> encoder.setProbes( 2 );
>>
>> learningAlgorithm = new AdaptiveLogisticRegression( 2, FEATURES, new
>> L1() );
>> learningAlgorithm.setInterval( 800 );
>> learningAlgorithm.setAveragingWindow( 500 );
>>
>> modelFile = ConfigProperties.getInstance().getProperty(
>> "spinn3r.classifier.model_file" );
>> }
>>
>>
>> public void train( String msgId, Map<String, Integer> codes )
>> throws IOException
>> {
>> int actual = msgs.intern( msgId );
>>
>> Vector v = encodeFeatureVector( codes, actual );
>> learningAlgorithm.train( actual, v );
>>
>> k++;
>>
>> int bump = bumps[(int)Math.floor( step ) % bumps.length];
>> int scale = (int)Math.pow( 10, Math.floor( step / bumps.length ) );
>>
>> State<AdaptiveLogisticRegression.Wrapper, CrossFoldLearner> best =
>> learningAlgorithm.getBest();
>>
>> double maxBeta;
>> double norm;
>> double nonZeros;
>> double positive;
>> double lambda = 0;
>> double mu = 0;
>>
>> if( best != null )
>> {
>> CrossFoldLearner state = best.getPayload().getLearner();
>>
>> averageCorrect = state.percentCorrect();
>> averageLL = state.logLikelihood();
>>
>> OnlineLogisticRegression model = state.getModels().get( 0 );
>> model.close();
>>
>> Matrix beta = model.getBeta();
>> maxBeta = beta.aggregate( Functions.MAX, Functions.ABS );
>> norm = beta.aggregate( Functions.PLUS, Functions.ABS );
>> nonZeros = beta.aggregate( Functions.PLUS, new DoubleFunction()
>> {
>> @Override
>> public double apply(
>> double v )
>> {
>> return Math.abs(v)
>> > 1.0e6 ? 1 : 0;
>> }
>> });
>> positive = beta.aggregate( Functions.PLUS, new DoubleFunction()
>> {
>> @Override
>> public double apply(
>> double v )
>> {
>> return v > 0 ? 1 :
>> 0;
>> }
>> });
>>
>> lambda = learningAlgorithm.getBest().getMappedParams()[0];
>> mu = learningAlgorithm.getBest().getMappedParams()[1];
>> }
>> else
>> {
>> maxBeta = 0;
>> nonZeros = 0;
>> positive = 0;
>> norm = 0;
>> }
>>
>> if( k % ( bump * scale ) == 0 )
>> {
>> if( learningAlgorithm.getBest() != null )
>> {
>>
>> ModelSerializer.writeBinary("/Users/tim/spinn3r_data/model/snomed" + k
>> + ".model", learningAlgorithm.getBest().getPayload().getLearner());
>> }
>>
>> step += 0.25;
>> System.out.printf( "%.2f\t%.2f\t%.2f\t%.2f\t%.8g\t%.8g\t",
>> maxBeta, nonZeros, positive, norm, lambda, mu );
>> System.out.printf( "%d\t%.3f\t%.2f\n", k, averageLL,
>> averageCorrect * 100 );
>> }
>> }
>>
>>
>> public void finishTraining( String msgId, Map<String, Integer>codes )
>> throws IOException
>> {
>> try
>> {
>> learningAlgorithm.close();
>> } catch( Exception e )
>> {
>> System.out.println( "SnomedClassificationTrainer.train() 
>> learningAlgorithm.close() Error = " + e.getMessage() );
>> e.printStackTrace();
>> }
>>
>> dissect( msgId, learningAlgorithm, codes );
>>
>> ModelSerializer.writeBinary( modelFile,
>> learningAlgorithm.getBest().getPayload().getLearner() );
>> }
>>
>>
>> private Vector encodeFeatureVector( Map<String, Integer> codes, int
>> actual )
>> {
>> Vector v = new RandomAccessSparseVector( FEATURES );
>>
>> bias.addToVector( "", 1, v );
>>
>> for( String code : codes.keySet() )
>> {
>> Integer count = codes.get( code );
>> encoder.addToVector( code, Math.log( 1 + count.intValue() ), v );
>> }
>>
>> return v;
>> }
>>
>>
>> private void dissect( String msgId, AdaptiveLogisticRegression
>> learningAlgorithm, Map<String, Integer> codes )
>> {
>> CrossFoldLearner model =
>> learningAlgorithm.getBest().getPayload().getLearner();
>> model.close();
>>
>> Map<String, Set<Integer>> traceDictionary = Maps.newTreeMap();
>> ModelDissector md = new ModelDissector();
>>
>> encoder.setTraceDictionary( traceDictionary );
>> bias.setTraceDictionary( traceDictionary );
>>
>> int actual = msgs.intern( msgId );
>>
>> traceDictionary.clear();
>> Vector v = encodeFeatureVector( codes, actual );
>> md.update( v, traceDictionary, model );
>>
>> List<ModelDissector.Weight> weights = md.summary( 1000 );
>> for( ModelDissector.Weight w : weights )
>> {
>> System.out.printf( "%s\t%.1f\t%.1f\t%s\t%.1f\t%s\n",
>> w.getFeature(), w.getWeight(), w.getCategory( 1 ), w.getWeight( 1 ),
>> w.getCategory( 2 ), w.getWeight( 2 ) );
>> }
>> }
>>
>> }
>>
>> The End.
>>
>> Tim
>>
>>
>>  Original Message 
>> Subject: Re: AdaptiveLogisticRegression.close()
>> ArrayIndexOutOfBoundsException
>> From: Ted Dunning <ted.dunning@gmail.com>
>> Date: Sun, May 01, 2011 7:37 pm
>> To: user@mahout.apache.org
>>
>> Can you put your code on github?
>>
>> There is a detail that slipped somewhere and I can't guess where it is.
>> Your constructor is correct for a binary classifier, but I can't say
>> much
>> else.
>>
>> How much data, btw, did you pour in?
>>
>> On Sun, May 1, 2011 at 3:18 AM, Tim Snyder <tim@proinnovations.com>
>> wrote:
>>
>> > I am currently using trunk from April 30, 2011. The code is loosely
>> > following the SGD training example from Mahout in Action. I have
>> > instantiated the learner with the purpose of having a binary classifier
>> > :
>> >
>> > AdaptiveLogisticRegression learningAlgorithm = new
>> > AdaptiveLogisticRegression( 2, FEATURES, new L1() );
>> >
>> > Everything works fine (ie. the training) until I get to
>> > learningAlogrithm.close() where I get the following exception:
>> >
>> > learningAlgorithm.close() Error =
>> > java.lang.ArrayIndexOutOfBoundsException
>> > java.lang.IllegalStateException:
>> > java.lang.ArrayIndexOutOfBoundsException
>> > Exception = null
>> > at
>> >
>> >
>> org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.trainWithBufferedExamples(AdaptiveLogisticRegression.java:144)
>> > at
>> >
>> >
>> org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression.close(AdaptiveLogisticRegression.java:196)
>> > at
>> >
>> >
>> com.zensa.spinn3r.mahout.SnomedSDGClassificationTrainer.finishTraining(SnomedSDGClassificationTrainer.java:159)
>> > at
>> >
>> >
>> com.spinn3r.sdg.trainer.Spinn3rSDGTrainer.process(Spinn3rSDGTrainer.java:170)
>> > at
>> >
>> com.spinn3r.sdg.trainer.Spinn3rSDGTrainer.main(Spinn3rSDGTrainer.java:272)
>> > Caused by: java.lang.ArrayIndexOutOfBoundsException
>> >
>> > If I change the number of categories to 100, the close() works fine. Any
>> > ideas on how to get around this and have a working binary classifier?
>> >
>> > Thanks in advance.
>> >
>> > Tim
>> >
>> >
>>
>>
>
