Hi,
I have been trying to build a Decision Tree using a dataset that I have.
Dataset Decription :
Train data size = 689,763
Test data size = 8,387,813
Each row in the dataset has 321 numerical features out of which 139th value
is the ground truth.
The number of positives in the dataset is low. Number of positives = 12028
There are absolutely no missing values in the dataset. This is ensured by
preprocessing the dataset.
The outcome against which we are building the tree is a binary variable
taking values 0 OR 1.
Due to a few reasons, I am building a Regression Tree and not a
Classification Tree.
When we have 3 levels(maxDepth = 3), we get the tree immediately(a few
minutes), but it is performing poorly. When I computed the correlation
coefficient between the ground truth scores and the scores obtained by the
tree, I get a correlation coefficient of 0.013140 which is very low.
Even looking at individual predictions manually, it is seen that the
predictions are almost same at around 0.07 to 0.09 irrespective of whether
the particular row is positive or negative in the ground truth.
When the maxDepth is set to 5, it doesn't complete building the tree even
after several hours.
When I include the ground truth in the train data, it builds the tree in a
very small amount of time and even the predictions are correct, accuracy is
around 100%.(As Expected)
So, I have two queries :
1) Why is the performance so poor when we have maxDepth = 3 ?
2) Why isn't building a Regression Decision Tree feasible with maxDepth = 5
?
Here is the core part of the code I am using :
val ssc = new SparkContext(sparkMaster, "Spark exp 001", sparkHome,
jars)
val labelRDD = ssc.textFile(hdfsNN + "Path to data /training/part*, 12)
.map{st =>
val parts = st.split(",").map(_.toDouble)
LabeledPoint(parts(138), Vectors.dense((parts take
138) ++ (parts drop 139)))}
print(labelRDD.first)
val model = DecisionTree.train(labelRDD, Regression, Variance, 3)
val parsedData = ssc.textFile(hdfsNN + "Path to data /testing/part*",
12)
.map{st =>
val parts = st.split(",").map(_.toDouble)
LabeledPoint(parts(138), Vectors.dense((parts take
138) ++ (parts drop 139)))}
val labelAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
labelAndPreds.saveAsTextFile(hdfsNN + "Output path /labels")
When I build a Random Forest for the same dataset using Mahout, it builds
the forest in less than 5 minutes and gives a good accuracy. The amount of
memory and other resources available to Spark and to Mahout are comparable.
Spark had a memory of 30GB * 3 workers = 90GB in total.
Thanks and Regards,
Suraj Sheth
