spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From iguana314 <grys...@gmail.com>
Subject How to name features and perform custom cross validation in ML
Date Mon, 21 Mar 2016 06:29:40 GMT
Hello,

I'm trying to a simple linear regression in Spark ML. Below is my Data Frame
along with some sample code and output done via Spyder on a local spark
cluster.

*##########
#Begin Code
##########*
regressionDF.show(5)
+-------+--------------------+
|  label|            features|
+-------+--------------------+
|59222.0|[1.49297445325996...|
|68212.0|[1.49297445325996...|
|68880.0|[1.49297445325996...|
|69307.0|[1.49297445325996...|
|81900.0|[1.49297445325996...|
+-------+--------------------+
only showing top 5 rows

lr2 = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr2.fit(regressionDF)

# Print the coefficients and intercept for linear regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
Coefficients: [5942.44830928,-8073.06894164,81071.1787768]
Intercept: 48473.6291555

numFolds = 10
evaluator = RegressionEvaluator(predictionCol="prediction",
labelCol="label", metricName="rmse") 
pipeline = Pipeline(stages=[lrModel])

crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=ParamGridBuilder().build(),
    evaluator=evaluator,
    numFolds=numFolds)

CVModel = crossval.fit(regressionDF)
bestModel = CVModel.bestModel
cvPrediction = CVModel.transform(regressionDF).select("label", "prediction")

cvPrediction.show(5)
+-------+------------------+
|  label|        prediction|
+-------+------------------+
|59222.0|140493.58824997063|
|68212.0| 171442.7987818182|
|68880.0|135608.61939589953|
|69307.0|142447.57579159905|
|81900.0|135730.74361725134|
+-------+------------------+
only showing top 5 rows

*##########
#End Code
##########*

So the code seems to work but I'm unsure as to how to do the following
things:

*1)* I want to name every "column" of my feature vector. Then when it shows
me the coefficients, I can see what each column refers to?

*2)* How can I see the best model selected from the CV model? I realize this
might not be applicable for something like random forests, but for linear
regression this might be quite useful?

*3)*  Is there a way to specify the range it considers for the coefficients
for each feature (for instance, I want feature 2 to have a coefficient no
larger than some number Beta_2)? I realize that using Lasso we can constrain
all the features using the L1 norm in Lasso, but what about a specific
feature?

*4) * what's the easiest way to get something like R^2 or adjusted R^2? I
can code it manually but is any of it built in?

Thank you for your help!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-name-features-and-perform-custom-cross-validation-in-ML-tp26545.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message