spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame
Date Thu, 02 Apr 2015 07:05:18 GMT
I reproduced the bug on master and submitted a patch for it:
https://github.com/apache/spark/pull/5329. It may get into Spark
1.3.1. Thanks for reporting the bug! -Xiangrui

On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa <jaonary@gmail.com> wrote:
> Hmm, I got the same error with the master. Here is another test example that
> fails. Here, I explicitly create
> a Row RDD which corresponds to the use case I am in :
>
> object TestDataFrame {
>
>   def main(args: Array[String]): Unit = {
>
>     val conf = new
> SparkConf().setAppName("TestDataFrame").setMaster("local[4]")
>     val sc = new SparkContext(conf)
>     val sqlContext = new SQLContext(sc)
>
>     import sqlContext.implicits._
>
>     val data = Seq(LabeledPoint(1, Vectors.zeros(10)))
>     val dataDF = sc.parallelize(data).toDF
>
>     dataDF.printSchema()
>     dataDF.save("test1.parquet") // OK
>
>     val dataRow = data.map {case LabeledPoint(l: Double, f:
> mllib.linalg.Vector)=>
>       Row(l,f)
>     }
>
>     val dataRowRDD = sc.parallelize(dataRow)
>     val dataDF2 = sqlContext.createDataFrame(dataRowRDD, dataDF.schema)
>
>     dataDF2.printSchema()
>
>     dataDF2.saveAsParquetFile("test3.parquet") // FAIL !!!
>   }
> }
>
>
> On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>>
>> I cannot reproduce this error on master, but I'm not aware of any
>> recent bug fixes that are related. Could you build and try the current
>> master? -Xiangrui
>>
>> On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa <jaonary@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > DataFrame with an user defined type (here mllib.Vector) created with
>> > sqlContex.createDataFrame can't be saved to parquet file and raise
>> > ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be
>> > cast
>> > to org.apache.spark.sql.Row error.
>> >
>> > Here is an example of code to reproduce this error :
>> >
>> > object TestDataFrame {
>> >
>> >   def main(args: Array[String]): Unit = {
>> >     //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
>> >     val conf = new
>> > SparkConf().setAppName("RankingEval").setMaster("local[8]")
>> >       .set("spark.executor.memory", "6g")
>> >
>> >     val sc = new SparkContext(conf)
>> >     val sqlContext = new SQLContext(sc)
>> >
>> >     import sqlContext.implicits._
>> >
>> >     val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10))))
>> >     val dataDF = data.toDF
>> >
>> >     dataDF.save("test1.parquet")
>> >
>> >     val dataDF2 = sqlContext.createDataFrame(dataDF.rdd, dataDF.schema)
>> >
>> >     dataDF2.save("test2.parquet")
>> >   }
>> > }
>> >
>> >
>> > Is this related to https://issues.apache.org/jira/browse/SPARK-5532 and
>> > how
>> > can it be solved ?
>> >
>> >
>> > Cheers,
>> >
>> >
>> > Jao
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message