Yes, I rechecked and the label is correct. As you can see in the code posted, I used the exact same features for all the classifiers so unless rf somehow switches the labels, it should be correct.

I have posted a sample dataset and sample code to reproduce what I'm getting here:

https://github.com/pkphlam/spark_rfpredict

On Tue, Aug 4, 2015 at 6:42 AM, Yanbo Liang <ybliang8@gmail.com> wrote:
It looks like the predicted result just opposite with expectation, so could you check whether the label is right?
Or could you share several data which can help to reproduce this output?  

2015-08-03 19:36 GMT+08:00 Barak Gitsis <barakg@similarweb.com>:
hi,
I've run into some poor RF behavior, although not as pronounced as you.. would be great to get more insight into this one

Thanks!

On Mon, Aug 3, 2015 at 8:21 AM pkphlam <pkphlam@gmail.com> wrote:
Hi,

This might be a long shot, but has anybody run into very poor predictive
performance using RandomForest with Mllib? Here is what I'm doing:

- Spark 1.4.1 with PySpark
- Python 3.4.2
- ~30,000 Tweets of text
- 12289 1s and 15956 0s
- Whitespace tokenization and then hashing trick for feature selection using
10,000 features
- Run RF with 100 trees and maxDepth of 4 and then predict using the
features from all the 1s observations.

So in theory, I should get predictions of close to 12289 1s (especially if
the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous to
me and makes me suspect something is wrong with my code or I'm missing
something. I notice similar behavior (although not as extreme) if I play
around with the settings. But I'm getting normal behavior with other
classifiers, so I don't think it's my setup that's the problem.

For example:

>>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>>> logit_predict = lrm.predict(predict_feat)
>>> logit_predict.sum()
9077

>>> nb = NaiveBayes.train(lp)
>>> nb_predict = nb.predict(predict_feat)
>>> nb_predict.sum()
10287.0

>>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>>> rf_predict = rf.predict(predict_feat)
>>> rf_predict.sum()
0.0

This code was all run back to back so I didn't change anything in between.
Does anybody have a possible explanation for this?

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extremely-poor-predictive-performance-with-RF-in-mllib-tp24112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

--
-Barak




--
Patrick Lam
Institute for Quantitative Social Science, Harvard University
http://www.patricklam.org