spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject replace some partitions when writing dataframe
Date Thu, 17 Nov 2016 19:09:33 GMT
i am looking into writing a dataframe to parquet using partioning. so
something like

df
  .write
  .mode(saveMode)
  .partitionBy(partitionColumn)
  .format("parquet")
  .save(path)

i imagine i will have thousands of partitions. generally my goal is not to
recreate all partitions every time, but just a few partitions. the
partitions i do write to i want to replace all the data in.

i would expect this to be a general and typical use case since a true
append (adding data to partitions) is messy and not idempotent and to be
avoided by design (in fact i am not sure why it exists at all, unless
transactions are supported). redoing all partitions is very inefficient.

what saveMode do i use? in my tests if i use saveMode=Overwrite then i lose
all partitions. if i use saveMode=Append is the dangerous non-idempotent
usage that adds to partitions. i dont think saveMode=Ignore or
saveMode=ErrorIfExists will help me either.

Mime
View raw message