spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pkphlam <>
Subject Extremely poor predictive performance with RF in mllib
Date Mon, 03 Aug 2015 05:20:56 GMT

This might be a long shot, but has anybody run into very poor predictive
performance using RandomForest with Mllib? Here is what I'm doing:

- Spark 1.4.1 with PySpark
- Python 3.4.2
- ~30,000 Tweets of text
- 12289 1s and 15956 0s
- Whitespace tokenization and then hashing trick for feature selection using
10,000 features
- Run RF with 100 trees and maxDepth of 4 and then predict using the
features from all the 1s observations.

So in theory, I should get predictions of close to 12289 1s (especially if
the model overfits). But I'm getting exactly 0 1s, which sounds ludicrous to
me and makes me suspect something is wrong with my code or I'm missing
something. I notice similar behavior (although not as extreme) if I play
around with the settings. But I'm getting normal behavior with other
classifiers, so I don't think it's my setup that's the problem.

For example:

>>> lrm = LogisticRegressionWithSGD.train(lp, iterations=10)
>>> logit_predict = lrm.predict(predict_feat)
>>> logit_predict.sum()

>>> nb = NaiveBayes.train(lp)
>>> nb_predict = nb.predict(predict_feat)
>>> nb_predict.sum()

>>> rf = RandomForest.trainClassifier(lp, numClasses=2,
>>> categoricalFeaturesInfo={}, numTrees=100, seed=422)
>>> rf_predict = rf.predict(predict_feat)
>>> rf_predict.sum()

This code was all run back to back so I didn't change anything in between.
Does anybody have a possible explanation for this?


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message