spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: Parition RDD by key to create DataFrames
Date Tue, 15 Mar 2016 18:08:47 GMT
I think you could create a DataFrame with schema (mykey, value1,
value2), then partition it by mykey when saving as parquet.

r2 = rdd.map((k, v) => Row(k, v._1, v._2))
df  = sqlContext.createDataFrame(r2, schema)
df.write.partitionBy("myKey").parquet(path)


On Tue, Mar 15, 2016 at 10:33 AM, Mohamed Nadjib MAMI
<mami@iai.uni-bonn.de> wrote:
> Hi,
>
> I have a pair RDD of the form: (mykey, (value1, value2))
>
> How can I create a DataFrame having the schema [V1 String, V2 String] to
> store [value1, value2] and save it into a Parquet table named "mykey"?
>
> createDataFrame() method takes an RDD and a schema (StructType) in
> parameters. The schema is known up front ([V1 String, V2 String]), but
> getting an RDD by partitioning the original RDD based on the key is what I
> can't get my head around so far.
>
> Similar questions have been around (like
> http://stackoverflow.com/questions/25046199/apache-spark-splitting-pair-rdd-into-multiple-rdds-by-key-to-save-values)
> but they do not use DataFrames.
>
> Thanks in advance!
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message