spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manish Maheshwari <mylogi...@gmail.com>
Subject LinearRegressionModel - Negative Predicted Value
Date Mon, 06 Mar 2017 08:05:26 GMT
Hi All,

We are using a LinearRegressionModel in Scala. We are using a standard
StandardScaler to normalize the data before modelling.. the Code snippet
looks like this -

*Modellng - *
val labeledPointsRDD = tableRecords.map(row =>
{
val filtered = row.toSeq.filter({ case s: String => false case _ => true })
val converted = filtered.map({ case i: Int => i.toDouble case l: Long =>
l.toDouble case d: Double => d case _ => 0.0 })
val features = Vectors.dense(converted.slice(1, converted.length).toArray)
LabeledPoint(converted(0), features)
})
val scaler1 = new StandardScaler().fit(labeledPointsRDD.map(x =>
x.features))
save(sc, scalarModelOutputPath, scaler1)
val normalizedData = labeledPointsRDD.map(lp => {LabeledPoint(lp.label,
scaler1.transform(lp.features))})
val splits = normalizedData.randomSplit(Array(0.8, 0.2))
val trainingData = splits(0)
val testingData = splits(1)
trainingData.cache()
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.01)
val model = regression.run(trainingData)
model.save(sc, modelOutputPath)

Post that when we score the model on the same data that it was trained on
using the below snippet we see this -

*Scoring - *
val labeledPointsRDD = tableRecords.map(row =>
{val filtered = row.toSeq.filter({ case s: String => false case _ => true })
val converted = filtered.map({ case i: Int => i.toDouble case l: Long =>
l.toDouble case d: Double => d case _ => 0.0 })
val features = Vectors.dense(converted.toArray)
(row(0), features)
})
val scaler1 = read(sc,scalarModelOutputPath)
val normalizedData = labeledPointsRDD.map(p => (p._1,
scaler1.transform(p._2)))
normalizedData.cache()
val model = LinearRegressionModel.load(sc,modelOutputPath)
val valuesAndPreds = normalizedData.map(p => (p._1.toString(),
model.predict(p._2)))

However, a lot of predicted values are negative. The input data has no
negative values we we are unable to understand this behaviour.
Further the order and sequence of all the variables remains the same in the
modelling and testing data frames.

Any ideas?

Thanks,
Manish

Mime
View raw message