spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhiliang Zhu <>
Subject spark ml : auc on extreme distributed data
Date Mon, 15 Aug 2016 04:11:39 GMT
Hi All, 
Here I have lot of data with around 1,000,000 rows, 97% of them are negative class and 3%
of them are positive class .  I applied Random Forest algorithm to build the model and predict
the testing data.
For the data preparation,i. firstly randomly split all the data as training data and testing
data by 0.7 : 0.3ii. let the testing data unchanged, its negative and positive class ratio
would still be 97:3iii. try to make the training data negative and positive class ratio as
50:50, by way of sample algorithm in the different classesiv. get RF model by training data
and predict testing data
by modifying algorithm parameters and feature work (PCA etc ), it seems that the auc on the
testing data is always above 0.8, or much more higher ...
Then I lose into some confusion... It seems that the model or auc depends a lot on the original
data distribution...In effect, I would like to know, for this data distribution, how its auc
would be for random guess?What the auc would be for any kind of data distribution?
Thanks in advance~~
View raw message