mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <bimargul...@gmail.com>
Subject Re: Logistic Regression Tutorial
Date Fri, 29 Apr 2011 21:20:16 GMT
The following had no effect.

        AdaptiveLogisticRegression model = new
AdaptiveLogisticRegression(topicNumbers.size(), FEATURES, new L1());
        model.setInterval(200, 200);
        model.setAveragingWindow(10);
        model.setPoolSize(10);

On Fri, Apr 29, 2011 at 1:58 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Hmm... this looks very wrong.  AUC is 0.5 here which indicates that it has
> no data.
>
> There are a few options on AdaptiveLogisticRegression to set the averaging
> window and
> multi-threading batch size.  These probably should be set very small for
> your example (which
> is far smaller than I envisioned for this code).
>
> Alternately, I can set up a non-adaptive trainer that does the EP outside of
> the learning.  This
> is much slower and much less scalable, but that hardly matters for a
> toy-sized problem.
>
> Let me know if you need that.
>
> On Fri, Apr 29, 2011 at 10:40 AM, Benson Margulies <bimargulies@gmail.com>wrote:
>
>> If I read this right, the AUC is constant:
>>
>>         1       1000 0.50 95.6
>>         2       1000 0.50 93.9
>>         3       1000 0.50 92.4
>>         4       1000 0.50 91.1
>>         5       1000 0.50 87.3
>>         6       1000 0.50 86.4
>>         7       1000 0.50 85.5
>>         8       1000 0.50 84.6
>>         9       1000 0.50 83.8
>>                          0.50 83.1 (final)
>>
>> Where do I go from here? Just run one iteration? Wait for more data?
>>
>> On Fri, Apr 29, 2011 at 12:39 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> > Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of me,
>> > but I haven't had time to drill in on it.
>> >
>> > With RCV1, however, the AUC stayed constant and high.  AUC is what the
>> > evolutionary algorithm is fighting for while percent correct is only for
>> a
>> > single threshold (0.5 for the binary case).  With asymmetric class rates,
>> > that threshold might be sub-optimal.  AUC doesn't use a threshold so that
>> > won't be an issue with it.  It is pretty easy to make the evo algorithm
>> use
>> > percent-correct instead of AUC.
>> >
>> > Regarding the over-fitting, these accuracies are on-line estimates being
>> > reported on held-out data so it should be a reasonable estimate of error.
>> >  With a time-based train/test split, test performance will probably be a
>> bit
>> > lower than the estimate.
>> >
>> > The held-out data is formed by doing cross validation on the fly.  Each
>> > CrossFoldLearner inside the evolutionary algorithm maintains 5 online
>> > learning algorithms each of which gets a different split of training and
>> > test data.  This means that we get an out-of-sample estimate of
>> performance
>> > every time we add a training sample.
>> >
>> > On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <bimargulies@gmail.com
>> >wrote:
>> >
>> >> After the first pass, the model hasn't trained yet. After the second,
>> >> accuracy is 95.6%, and then if drifts gracefully downward with each
>> >> additional iteration, landing at .83.
>> >>
>> >> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
>> >> but this pattern is not intuitive to me.
>> >>
>> >
>>
>

Mime
View raw message