spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From naresh Goud <nareshgoud.du...@gmail.com>
Subject Re: partitionBy with partitioned column in output?
Date Mon, 26 Feb 2018 22:47:08 GMT
is this helps?

sc.parallelize(List((1,10),(2,
20))).toDF("foo","bar").map(("foo","bar")=>("foo",("foo","bar"))).
partitionBy("foo").json("json-out")


On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetsky <alex.nastetsky@verve.com>
wrote:

> Is there a way to make outputs created with "partitionBy" to contain the
> partitioned column? When reading the output with Spark or Hive or similar,
> it's less of an issue because those tools know how to perform partition
> discovery. But if I were to load the output into an external data warehouse
> or database, it would have no idea.
>
> Example below -- a dataframe with two columns "foo" and "bar" is
> partitioned by "foo", but the data only contains "bar", since it expects
> the reader to know how to derive the value of "foo" from the parent
> directory. Note that it's the same thing with Parquet and Avro as well, I
> just chose to use JSON in my example.
>
> scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write.
> partitionBy("foo").json("json-out")
>
>
> $ ls json-out/
> foo=1  foo=2  _SUCCESS
> $ cat json-out/foo=1/part-00003-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json
> {"bar":10}
> $ cat json-out/foo=2/part-00007-18ca93d0-c3b1-424b-8ad5-291d8a29523b.json
> {"bar":20}
>
> Thanks,
> Alex.
>

Mime
View raw message