spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samy Dindane <s...@dindane.com>
Subject DataFrame API: how to partition by a "virtual" column, or by a nested column?
Date Tue, 04 Oct 2016 13:10:53 GMT
Hi,

I have the following schema:

-root
  |-timestamp
  |-date
    |-year
    |-month
    |-day
  |-some_column
  |-some_other_column

I'd like to achieve either of these:

1) Use the timestamp field to partition by year, month and day.
This looks weird though, as Spark wouldn't magically know how to load the data back since
the year, month and day columns don't exist in the schema.

2) If 1) is not possible, partition data by date.year, date.month and date.day.
`df.write.partitionBy('date.year')` does not work, since the `date.year` column does not exist
in the schema.

If 2) isn't possible either, I'll just move year, month and day to the root of the schema,
which I don't like as it bloats it.

Do you know if any of these is possible?

Thank you,

Samy

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message