spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Nastetsky <alex.nastet...@verve.com>
Subject Re: partitionBy with partitioned column in output?
Date Mon, 26 Feb 2018 22:56:13 GMT
Yeah, was just discussing this with a co-worker and came to the same
conclusion -- need to essentially create a copy of the partition column.
Thanks.

Hacky, but it works. Seems counter-intuitive that Spark would remove the
column from the output... should at least give you an option to keep it.

On Mon, Feb 26, 2018 at 5:47 PM, naresh Goud <nareshgoud.dulam@gmail.com>
wrote:

> is this helps?
>
> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").map(("
> foo","bar")=>("foo",("foo","bar"))).partitionBy("foo").json("json-out")
>
>
> On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetsky <alex.nastetsky@verve.com>
> wrote:
>
>> Is there a way to make outputs created with "partitionBy" to contain the
>> partitioned column? When reading the output with Spark or Hive or similar,
>> it's less of an issue because those tools know how to perform partition
>> discovery. But if I were to load the output into an external data warehouse
>> or database, it would have no idea.
>>
>> Example below -- a dataframe with two columns "foo" and "bar" is
>> partitioned by "foo", but the data only contains "bar", since it expects
>> the reader to know how to derive the value of "foo" from the parent
>> directory. Note that it's the same thing with Parquet and Avro as well, I
>> just chose to use JSON in my example.
>>
>> scala> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").write.
>> partitionBy("foo").json("json-out")
>>
>>
>> $ ls json-out/
>> foo=1  foo=2  _SUCCESS
>> $ cat json-out/foo=1/part-00003-18ca93d0-c3b1-424b-8ad5-291d8a2952
>> 3b.json
>> {"bar":10}
>> $ cat json-out/foo=2/part-00007-18ca93d0-c3b1-424b-8ad5-291d8a2952
>> 3b.json
>> {"bar":20}
>>
>> Thanks,
>> Alex.
>>
>
>

Mime
View raw message