spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Lam <pkph...@gmail.com>
Subject Re: Extremely poor predictive performance with RF in mllib
Date Tue, 04 Aug 2015 17:34:08 GMT
Yes, I rechecked and the label is correct. As you can see in the code
posted, I used the exact same features for all the classifiers so unless rf
somehow switches the labels, it should be correct.

I have posted a sample dataset and sample code to reproduce what I'm
getting here:

https://github.com/pkphlam/spark_rfpredict

On Tue, Aug 4, 2015 at 6:42 AM, Yanbo Liang <ybliang8@gmail.com> wrote:

> It looks like the predicted result just opposite with expectation, so
> could you check whether the label is right?
> Or could you share several data which can help to reproduce this output?
>
> 2015-08-03 19:36 GMT+08:00 Barak Gitsis <barakg@similarweb.com>:
>
>> hi,
>> I've run into some poor RF behavior, although not as pronounced as you..
>> would be great to get more insight into this one
>>
>> Thanks!
>>
>> On Mon, Aug 3, 2015 at 8:21 AM pkphlam <pkphlam@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> This might be a long shot, but has anybody run into very poor predictive
>>> performance using RandomForest with Mllib? Here is what I'm doing:
>>>
>>> - Spark 1.4.1 with PySpark
>>> - Python 3.4.2
>>> - ~30,000 Tweets of text
>>> - 12289 1s and 15956 0s
>>> - Whitespace tokenization and then hashing trick for feature selection
>>> using
>>> 10,000 features
>>> - Run RF with 100 trees and maxDepth of 4 and then predict using the
>>> features from all the 1s observations.
>>>
>>> So in theory, I should get predictions of close to 12289 1s (especially
>>> if
>>> the model overfits). But I'm getting exactly 0 1s, which sounds
>>> ludicrous to
>>> me and makes me suspect something is wrong with my code or I'm missing
>>> something. I notice similar behavior (although not as extreme) if I play
>>> around with the settings. But I'm getting normal behavior with other
>>> classifiers, so I don't think it's my setup that's the problem.
>>>
>>> For example:
>>>
>>> >>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>>> >>> logit_predict = lrm.predict(predict_feat)
>>> >>> logit_predict.sum()
>>> 9077
>>>
>>> >>> nb = NaiveBayes.train(lp)
>>> >>> nb_predict = nb.predict(predict_feat)
>>> >>> nb_predict.sum()
>>> 10287.0
>>>
>>> >>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>>> >>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>>> >>> rf_predict = rf.predict(predict_feat)
>>> >>> rf_predict.sum()
>>> 0.0
>>>
>>> This code was all run back to back so I didn't change anything in
>>> between.
>>> Does anybody have a possible explanation for this?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>> --
>> *-Barak*
>>
>
>


-- 
Patrick Lam
Institute for Quantitative Social Science, Harvard University
http://www.patricklam.org

Mime
View raw message