spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samy Dindane <s...@dindane.com>
Subject Re: DataFrame API: how to partition by a "virtual" column, or by a nested column?
Date Thu, 13 Oct 2016 09:40:26 GMT
This partially answers the question: http://stackoverflow.com/a/35449563/604041

On 10/04/2016 03:10 PM, Samy Dindane wrote:
> Hi,
>
> I have the following schema:
>
> -root
>  |-timestamp
>  |-date
>    |-year
>    |-month
>    |-day
>  |-some_column
>  |-some_other_column
>
> I'd like to achieve either of these:
>
> 1) Use the timestamp field to partition by year, month and day.
> This looks weird though, as Spark wouldn't magically know how to load the data back since
the year, month and day columns don't exist in the schema.
>
> 2) If 1) is not possible, partition data by date.year, date.month and date.day.
> `df.write.partitionBy('date.year')` does not work, since the `date.year` column does
not exist in the schema.
>
> If 2) isn't possible either, I'll just move year, month and day to the root of the schema,
which I don't like as it bloats it.
>
> Do you know if any of these is possible?
>
> Thank you,
>
> Samy
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message