spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Halliday <pjh...@cornell.edu>
Subject SaveMode, parquet and S3
Date Tue, 01 Mar 2016 03:34:45 GMT
I have a system where I’m saving parquet files to S3 via Spark.  They are partitioned a couple
of ways first by date and then by a partition key.  There are multiple parquet files per combination
over long period of time.  So the structure is like this:

s3://bucketname/date=2016-02-29/partionkey=2342/filename.parquet.gz

There’s been disagreement on how the SaveMode should be used for in saving out the data.
 If we keep the SaveMode as ErrorIfExists, will that means additional partitions or parquet
files that are written out later with the same parts of the subpath won’t be able to be
written successfully?  Also, does the SaveMode apply to Tasks too.  Say, we are using the
Direct Output Committer, and there’s a failure in a task that causes some files to be written
and others in the task to not be written.  Would it automatically inherit the SaveMode in
the individual file’s case.  or is the SaveMode only apply to the files as a whole?

Peter Halliday
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message