mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Bauer <b...@gmx.net>
Subject Re: OnlineLogisticRegression: Are my settings sensible
Date Fri, 08 Nov 2013 05:45:53 GMT
Hi, 

Thanks for your comments. 

I modified the examples from the mahout in action book,  therefore I used the hashed approach
and that's why i used 100 features. I'll adjust the number.

You say that I'm using the same CVE for all features,  so you mean i should create 12 separate
CVE for adding features to the vector like this? 


BIAS.addToVector((byte[]) null,1, denseVector);

this. cve1.addToVector((byte[]) null,
sample.getFeatureValue1(), denseVector);
... 
this. cve12.addToVector((byte[]) null,
sample.getFeatureValue12(), denseVector);

It's only a typo 12/15. Should be getFeatureValue12.

Finally, I thought online logistic regression meant that it is an online algorithm so it's
fine to train only once. Does it mean, should i invoke the train method over and over again
with the same training sample until the next one arrives or how should i make the model converge
(or at least try to with the few samples) ?

What would you suggest to use for incremental training instead of OLR?  Is mahout perhaps
the wrong library? 

Many thanks, 

Andreas 



Ted Dunning <ted.dunning@gmail.com> schrieb:
>Why is FEATURE_NUMBER != 13?
>
>With 12 features that are already lovely and continuous, just stick
>them in
>elements 1..12 of a 13 long vector and put a constant value at the
>beginning of it.  Hashed encoding is good for sparse stuff, but
>confusing
>for your case.
>
>Also, it looks like you only pass through the (very small) training set
>once.  The OnlineLogisticRegression is unlikely to converge very well
>with
>such a small number of examples.
>
>Finally, in the hashed representation that you are using, you use
>exactly
>the same CVE to put all 15 (12?) of the variables into the vector. 
>Since
>you are using the same CVE, all of these values will be put into
>exactly
>the same location which is going to kill performance since you will get
>the
>effect of summing all your variables together.
>
>
>
>
>
>On Thu, Nov 7, 2013 at 1:48 PM, Andreas Bauer <buki@gmx.net> wrote:
>
>> Hi,
>>
>> I’m trying to use OnlineLogisticRegression for a two-class
>classification
>> problem, but as my classification results are not very good, I wanted
>to
>> ask for support to find out if my settings are correct and if I’m
>using
>> Mahout correctly. Because if I’m doing it correctly then probably my
>> features are crap...
>>
>> In total I have 12 features. All are continuous values and all are
>> normalized/standardized (has not effect on the classification
>performance
>> at the moment).
>>
>> Training samples keep flowing in at constant rate (i.e. incremental
>> training), but in total it won’t be more than a few thousand (class
>split
>> pos/negative 30:70).
>>
>> My performance measure do not really get good, e.g. with approx. 3600
>> training samples I get
>>
>> f-measure(beta=0.5): 0.38
>> precision: 0.33
>> recall: 0.47
>>
>> The parameters I use are
>>
>> lambda=0.0001
>> offset=1000
>> alpha=1
>> decay_exponent=0.9
>> learning_rate=50
>>
>>
>> FEATURE_NUMBER = 100;
>> CATEGORIES_NUMBER = 2;
>>
>>
>>
>> Java code snip:
>>
>> private OnlineLogisticRegression olr;
>> private ContinuousValueEncoder continousValueEncoder;
>>
>> private static final FeatureVectorEncoder BIAS = new
>> ConstantValueEncoder("Intercept“);
>>
>> …
>> public Training() {
>>        olr = new OnlineLogisticRegression(CATEGORIES_NUMBER,
>> FEATURE_NUMBER,new L1()); //L2 or ElasticBandPrior do not affect the
>> performance
>>
>> 
>olr.lambda(lambda).learningRate(learning_rate).stepOffset(offset).decayExponent(decay_exponent);
>>        this.continousValueEncoder = new
>> ContinuousValueEncoder("ContinuousValueEncoder");
>>        this.continousValueEncoder.setProbes(20);
>>       ….
>> }
>>
>>
>> public void train(TrainingSample sample, int target){
>> DenseVector denseVector = new DenseVector(FEATURE_NUMBER);
>> //sample.getFeatureValue1-15() return a double value
>>         this.continousValueEncoder.addToVector((byte[]) null,
>> sample.getFeatureValue1(), denseVector);
>> ….
>> this.continousValueEncoder.addToVector((byte[]) null,
>> sample.getFeatureValue15(), denseVector);
>> BIAS.addToVector((byte[]) null, 1, denseVector);
>>         olr.train(target, denseVector);
>> }
>>
>> It is also interesting to notice, that when I use the model both test
>and
>> classification yield always probabilities of 1.0 or 0.99xxx for
>either
>> class.
>>
>> result = this.olr.classifyFull(input);
>> LOGGER.debug("TrainingSink test: classify real category:"
>> + realCategory + " olr classifier result: "
>> + result.maxValueIndex() + " prob: " + result.maxValue());
>>
>>
>>
>>
>> Would be great if you could give me some advise.
>>
>> Many thanks,
>>
>> Andreas
>>
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message