spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaonary Rabarisoa <jaon...@gmail.com>
Subject Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame
Date Fri, 03 Apr 2015 08:43:05 GMT
Good! Thank you.

On Thu, Apr 2, 2015 at 9:05 AM, Xiangrui Meng <mengxr@gmail.com> wrote:

> I reproduced the bug on master and submitted a patch for it:
> https://github.com/apache/spark/pull/5329. It may get into Spark
> 1.3.1. Thanks for reporting the bug! -Xiangrui
>
> On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa <jaonary@gmail.com>
> wrote:
> > Hmm, I got the same error with the master. Here is another test example
> that
> > fails. Here, I explicitly create
> > a Row RDD which corresponds to the use case I am in :
> >
> > object TestDataFrame {
> >
> >   def main(args: Array[String]): Unit = {
> >
> >     val conf = new
> > SparkConf().setAppName("TestDataFrame").setMaster("local[4]")
> >     val sc = new SparkContext(conf)
> >     val sqlContext = new SQLContext(sc)
> >
> >     import sqlContext.implicits._
> >
> >     val data = Seq(LabeledPoint(1, Vectors.zeros(10)))
> >     val dataDF = sc.parallelize(data).toDF
> >
> >     dataDF.printSchema()
> >     dataDF.save("test1.parquet") // OK
> >
> >     val dataRow = data.map {case LabeledPoint(l: Double, f:
> > mllib.linalg.Vector)=>
> >       Row(l,f)
> >     }
> >
> >     val dataRowRDD = sc.parallelize(dataRow)
> >     val dataDF2 = sqlContext.createDataFrame(dataRowRDD, dataDF.schema)
> >
> >     dataDF2.printSchema()
> >
> >     dataDF2.saveAsParquetFile("test3.parquet") // FAIL !!!
> >   }
> > }
> >
> >
> > On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng <mengxr@gmail.com>
> wrote:
> >>
> >> I cannot reproduce this error on master, but I'm not aware of any
> >> recent bug fixes that are related. Could you build and try the current
> >> master? -Xiangrui
> >>
> >> On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa <jaonary@gmail.com>
> >> wrote:
> >> > Hi all,
> >> >
> >> > DataFrame with an user defined type (here mllib.Vector) created with
> >> > sqlContex.createDataFrame can't be saved to parquet file and raise
> >> > ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot
> be
> >> > cast
> >> > to org.apache.spark.sql.Row error.
> >> >
> >> > Here is an example of code to reproduce this error :
> >> >
> >> > object TestDataFrame {
> >> >
> >> >   def main(args: Array[String]): Unit = {
> >> >     //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
> >> >     val conf = new
> >> > SparkConf().setAppName("RankingEval").setMaster("local[8]")
> >> >       .set("spark.executor.memory", "6g")
> >> >
> >> >     val sc = new SparkContext(conf)
> >> >     val sqlContext = new SQLContext(sc)
> >> >
> >> >     import sqlContext.implicits._
> >> >
> >> >     val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10))))
> >> >     val dataDF = data.toDF
> >> >
> >> >     dataDF.save("test1.parquet")
> >> >
> >> >     val dataDF2 = sqlContext.createDataFrame(dataDF.rdd,
> dataDF.schema)
> >> >
> >> >     dataDF2.save("test2.parquet")
> >> >   }
> >> > }
> >> >
> >> >
> >> > Is this related to https://issues.apache.org/jira/browse/SPARK-5532
> and
> >> > how
> >> > can it be solved ?
> >> >
> >> >
> >> > Cheers,
> >> >
> >> >
> >> > Jao
> >
> >
>

Mime
View raw message