spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <yblia...@gmail.com>
Subject Re: Extremely poor predictive performance with RF in mllib
Date Tue, 04 Aug 2015 10:42:07 GMT
It looks like the predicted result just opposite with expectation, so could
you check whether the label is right?
Or could you share several data which can help to reproduce this output?

2015-08-03 19:36 GMT+08:00 Barak Gitsis <barakg@similarweb.com>:

> hi,
> I've run into some poor RF behavior, although not as pronounced as you..
> would be great to get more insight into this one
>
> Thanks!
>
> On Mon, Aug 3, 2015 at 8:21 AM pkphlam <pkphlam@gmail.com> wrote:
>
>> Hi,
>>
>> This might be a long shot, but has anybody run into very poor predictive
>> performance using RandomForest with Mllib? Here is what I'm doing:
>>
>> - Spark 1.4.1 with PySpark
>> - Python 3.4.2
>> - ~30,000 Tweets of text
>> - 12289 1s and 15956 0s
>> - Whitespace tokenization and then hashing trick for feature selection
>> using
>> 10,000 features
>> - Run RF with 100 trees and maxDepth of 4 and then predict using the
>> features from all the 1s observations.
>>
>> So in theory, I should get predictions of close to 12289 1s (especially if
>> the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous
>> to
>> me and makes me suspect something is wrong with my code or I'm missing
>> something. I notice similar behavior (although not as extreme) if I play
>> around with the settings. But I'm getting normal behavior with other
>> classifiers, so I don't think it's my setup that's the problem.
>>
>> For example:
>>
>> >>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>> >>> logit_predict = lrm.predict(predict_feat)
>> >>> logit_predict.sum()
>> 9077
>>
>> >>> nb = NaiveBayes.train(lp)
>> >>> nb_predict = nb.predict(predict_feat)
>> >>> nb_predict.sum()
>> 10287.0
>>
>> >>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>> >>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>> >>> rf_predict = rf.predict(predict_feat)
>> >>> rf_predict.sum()
>> 0.0
>>
>> This code was all run back to back so I didn't change anything in between.
>> Does anybody have a possible explanation for this?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>> --
> *-Barak*
>

Mime
View raw message