spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <yblia...@gmail.com>
Subject Re: Saving and Loading Dataframes
Date Mon, 29 Feb 2016 03:28:55 GMT
Hi Raj,

If you choose JSON as the storage format, Spark SQL will store VectorUDT as
Array of Double.
So when you load back to memory, it can not be recognized as Vector.
One workaround is storing the DataFrame as parquet format, it will be
loaded and recognized as expected.

df.write.format("parquet").mode("overwrite").save(output)
> val data = sqlContext.read.format("parquet").load(output)


Thanks
Yanbo

2016-02-27 2:01 GMT+08:00 Raj Kumar <raj.kumar@hooklogic.com>:

> Thanks for the response Yanbo. Here is the source (it uses the
> sample_libsvm_data.txt file used in the
> mlliv examples).
>
> -Raj
> ————— IOTest.scala -------------
>
> import org.apache.spark.{SparkConf,SparkContext}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.DataFrame
>
> object IOTest {
>   val InputFile = "/tmp/sample_libsvm_data.txt"
>   val OutputDir ="/tmp/out"
>
>   val sconf = new SparkConf().setAppName("test").setMaster("local[*]")
>   val sqlc  = new SQLContext( new SparkContext( sconf ))
>   val df = sqlc.read.format("libsvm").load( InputFile  )
>   df.show; df.printSchema
>
>   df.write.format("json").mode("overwrite").save( OutputDir )
>   val data = sqlc.read.format("json").load( OutputDir )
>   data.show; data.printSchema
>
>   def main( args: Array[String]):Unit = {}
> }
>
>
> -----------------------
>
> On Feb 26, 2016, at 12:47 AM, Yanbo Liang <ybliang8@gmail.com> wrote:
>
> Hi Raj,
>
> Could you share your code which can help others to diagnose this issue?
> Which version did you use?
> I can not reproduce this problem in my environment.
>
> Thanks
> Yanbo
>
> 2016-02-26 10:49 GMT+08:00 raj.kumar <raj.kumar@hooklogic.com>:
>
>> Hi,
>>
>> I am using mllib. I use the ml vectorization tools to create the
>> vectorized
>> input dataframe for
>> the ml/mllib machine-learning models with schema:
>>
>> root
>>  |-- label: double (nullable = true)
>>  |-- features: vector (nullable = true)
>>
>> To avoid repeated vectorization, I am trying to save and load this
>> dataframe
>> using
>>    df.write.format("json").mode("overwrite").save( url )
>>     val data = Spark.sqlc.read.format("json").load( url )
>>
>> However when I load the dataframe, the newly loaded dataframe has the
>> following schema:
>> root
>>  |-- features: struct (nullable = true)
>>  |    |-- indices: array (nullable = true)
>>  |    |    |-- element: long (containsNull = true)
>>  |    |-- size: long (nullable = true)
>>  |    |-- type: long (nullable = true)
>>  |    |-- values: array (nullable = true)
>>  |    |    |-- element: double (containsNull = true)
>>  |-- label: double (nullable = true)
>>
>> which the machine-learning models do not recognize.
>>
>> Is there a way I can save and load this dataframe without the schema
>> changing.
>> I assume it has to do with the fact that Vector is not a basic type.
>>
>> thanks
>> -Raj
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> <http://nabble.com>.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>

Mime
View raw message