spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manish Tripathi <tr.man...@gmail.com>
Subject Spark Float to VectorUDT for ML evaluator lib
Date Fri, 04 Nov 2016 20:51:02 GMT
Hi

I am trying to run the ML Binary Evaluation Classifier metrics to compare
the rating with predicted values and get the AreaROC.

My dataframe has two columns with rating as int (I have binarized it) and
predicitions which is a float.

When I pass it to the ML evaluator method I get an error as shown below:

Can someone help me with gettng this sorted out?. Appreciate all the help

Stackoverflow post:

http://stackoverflow.com/questions/40408898/converting-the-float-column-in-spark-dataframe-to-vectorudt


I was trying to use the pyspark.ml.evaluation Binary classification metric
like below

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")

print evaluator.evaluate(predictions)

My Predictions data frame looks like this:

predictions.select('rating','prediction')

predictions.show()

+------+------------+

|rating|  prediction|

+------+------------+

|     1|  0.14829934|

|     1|-0.017862909|

|     1|   0.4951505|

|     1|0.0074382657|

|     1|-0.002562912|

|     1|   0.0208337|

|     1| 0.049362548|

|     1|  0.09693333|

|     1|  0.17998546|

|     1| 0.019649783|

|     1| 0.031353004|

|     1|  0.03657037|

|     1|  0.23280995|

|     1| 0.033190556|

|     1|  0.35569906|

|     1| 0.030974165|

|     1|   0.1422375|

|     1|  0.19786166|

|     1|  0.07740938|

|     1|  0.33970386|

+------+------------+

only showing top 20 rows

The datatype of each column is as follows:

predictions.printSchema()

root

 |-- rating: integer (nullable = true)

 |-- prediction: float (nullable = true)

Now I get an error with above Ml code saying prediction column is Float and
expected a VectorUDT.

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in
__call__(self, *args)

    811         answer = self.gateway_client.send_command(command)

    812         return_value = get_return_value(

--> 813             answer, self.gateway_client, self.target_id, self.name)

    814

    815         for temp_arg in temp_args:


/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)

     51                 raise AnalysisException(s.split(': ', 1)[1],
stackTrace)

     52             if s.startswith('java.lang.IllegalArgumentException: '):

---> 53                 raise IllegalArgumentException(s.split(': ', 1)[1],
stackTrace)

     54             raise

     55     return deco


IllegalArgumentException: u'requirement failed: Column prediction must be
of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually
FloatType.'

So I thought of converting the predictions column from float to VectorUDT
as below:

*Applying the schema to the dataframe to convert the float column type to
VectorUDT*

from pyspark.sql.types import IntegerType, StructType,StructField


schema = StructType([

    StructField("rating", IntegerType, True),

    StructField("prediction", VectorUDT(), True)

])



predictions_dtype=sqlContext.createDataFrame(prediction,schema)

But Now I get this error.

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

<ipython-input-30-8fce6c4bbeb4> in <module>()

      4

      5 schema = StructType([

----> 6     StructField("rating", IntegerType, True),

      7     StructField("prediction", VectorUDT(), True)

      8 ])


/Users/i854319/spark/python/pyspark/sql/types.pyc in __init__(self, name,
dataType, nullable, metadata)

    401         False

    402         """

--> 403         assert isinstance(dataType, DataType), "dataType should be
DataType"

    404         if not isinstance(name, str):

    405             name = name.encode('utf-8')


AssertionError: dataType should be DataType

It takes so much time to run an ml algo in spark libraries with so many
weird errors. Even I tried Mllib with RDD data. That is giving the
ValueError: Null pointer exception.



ᐧ

Mime
View raw message