spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <>
Subject [jira] [Created] (SPARK-23977) Add committer binding to Hadoop 3.1 PathOutputCommitter Mechanism
Date Fri, 13 Apr 2018 12:40:00 GMT
Steve Loughran created SPARK-23977:

             Summary: Add committer binding to Hadoop 3.1 PathOutputCommitter Mechanism
                 Key: SPARK-23977
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Steve Loughran

Hadoop 3.1 adds a mechanism for job-specific and store-specific committers (MAPREDUCE-6823,
MAPREDUCE-6956), and one key implementation, S3A committers, HADOOP-13786

These committers deliver high-performance output of MR and spark jobs to S3, and offer the
key semantics which Spark depends on: no visible output until job commit, a failure of a task
at an stage, including partway through task commit, can be handled by executing and committing
another task attempt. 

In contrast, the FileOutputFormat commit algorithms on S3 have issues:

* Awful performance because files are copied by rename
* FileOutputFormat v1: weak task commit failure recovery semantics as the (v1) expectation:
"directory renames are atomic" doesn't hold.
* S3 metadata eventual consistency can cause rename to miss files or fail entirely (SPARK-15849)

Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of the commit semantics
w.r.t observability of or recovery from task commit failure, on any filesystem.

The S3A committers address these by way of uploading all data to the destination through multipart
uploads, uploads which are only completed in job commit.

The new {{PathOutputCommitter}} factory mechanism allows applications to work with the S3A
committers and any other, by adding a plugin mechanism into the MRv2 FileOutputFormat class,
where it job config and filesystem configuration options can dynamically choose the output

Spark can use these with some binding classes to 

# Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 classes and {{PathOutputCommitterFactory}}
to create the committers.
# Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
to wire up Parquet output even when code requires the committer to be a subclass of {{ParquetOutputCommitter}}

This patch builds on SPARK-23807 for setting up the dependencies.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message