spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Nayak <snay...@gmail.com>
Subject SPARK-25959 - Difference in featureImportances results on computed vs saved models
Date Wed, 07 Nov 2018 03:04:00 GMT
Hi Spark Users,

I tried to implement GBT and found that the feature Importance computed
while the model was fit is different when the same model was saved into a
storage and loaded back.



I also found that once the persistent model is loaded and saved back again
and loaded, the feature importance remains the same.



Not sure if its bug while storing and reading the model first time or am
missing some parameter that need to be set before saving the model (thus
model is picking some defaults - causing feature importance to change)



*Below is the test code:*

val testDF = Seq(
(1, 3, 2, 1, 1),
(3, 2, 1, 2, 0),
(2, 2, 1, 1, 0),
(3, 4, 2, 2, 0),
(2, 2, 1, 3, 1)
).toDF("a", "b", "c", "d", "e")


val featureColumns = testDF.columns.filter(_ != "e")
// Assemble the features into a vector
val assembler = new VectorAssembler().setInputCols
(featureColumns).setOutputCol("features")
// Transform the data to get the feature data set
val featureDF = assembler.transform(testDF)

// Train a GBT model.
val gbt = new GBTClassifier()
.setLabelCol("e")
.setFeaturesCol("features")
.setMaxDepth(2)
.setMaxBins(5)
.setMaxIter(10)
.setSeed(10)
.fit(featureDF)

gbt.transform(featureDF).show(false)

// Write out the model

featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_
._2).take(20).foreach(println)
/* Prints

(d,0.5931875075767403)
(a,0.3747184548362353)
(b,0.03209403758702444)
(c,0.0)

*/
gbt.write.overwrite().save("file:///tmp/test123")

println("Reading model again")
val gbtload = GBTClassificationModel.load("file:///tmp/test123")

featureColumns.zip(gbtload.featureImportances.toArray).sortB
y(-_._2).take(20).foreach(println)

/*

Prints

(d,0.6455841215290767)
(a,0.3316126797964181)
(b,0.022803198674505094)
(c,0.0)

*/


gbtload.write.overwrite().save("file:///tmp/test123_rewrite")

val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")

featureColumns.zip(gbtload2.featureImportances.toArray).sort
By(-_._2).take(20).foreach(println)

/* prints
(d,0.6455841215290767)
(a,0.3316126797964181)
(b,0.022803198674505094)
(c,0.0)

*/

Any help is appreciated in making sure the feature importance is
maintenaned as is while the model is first stored.

Thanks!

Mime
View raw message