spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Output Committers for S3
Date Tue, 21 Feb 2017 13:52:01 GMT

On 20 Feb 2017, at 18:14, Matthew Schauer <<>>

I'm using Spark 1.5.2 and trying to append a data frame to partitioned
Parquet directory in S3.  It is known that the default
`ParquetOutputCommitter` performs poorly in S3 because move is implemented
as copy/delete, but the `DirectParquetOutputCommitter` is not safe to use
for append operations in case of failure.  I'm not very familiar with the
intricacies of job/task committing/aborting, but I've written a rough
replacement output committer that seems to work.  It writes the results
directly to their final locations and uses the write UUID to determine which
files to remove in the case of a job/task abort.  It seems to be a workable
concept in the simple tests that I've tried.  However, I can't make Spark
use this alternate output committer because the changes in SPARK-8578
categorically prohibit any custom output committer from being used, even if
it's safe for appending.  I have two questions: 1) Does anyone more familiar
with output committing have any feedback on my proposed "safe" append
strategy, and 2) is there any way to circumvent the restriction on append
committers without editing and recompiling Spark?  Discussion of solutions
in Spark 2.1 is also welcome.

Matthew, as part of the S3guard committer I'm doing in the Hadoop codebase (which requires
a consistent object store implemented natively or via a dynamo db database), I'm modifying
FileOutputFormat to take alternate committers underneath.



Modified FOF:

Current status: getting the low level tests at the MR layer working. Spark committer exists
to the point of compiling, but not yet tested. If you do want to get involved; the JIRA is:

View this message in context:
Sent from the Apache Spark Developers List mailing list archive at<>.

To unsubscribe e-mail:<>

View raw message