spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?
Date Tue, 03 Apr 2018 20:44:59 GMT

> On 3 Apr 2018, at 11:19, cane <> wrote:
> Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause
> data loss.
> I check the comment of thi api:
>  We should make sure our tasks are idempotent when speculation is enabled,
> i.e. do
>   * not use output committer that writes data directly.
>   * There is an example in
> to show the bad
>   * result of using direct output committer with speculation enabled.
>   */
> But if this the rule we must follow?
> For example,for parquet it will got ParquetOutPutCommitter.
> In this case, speculation must disable for parquet?
> Is there some one know the history?
> Thanks too much!

If you are writing to HDFS or object stores other than s3 and you make sure that you are using
the FileOutputFormat commit algorithm, you can use speculation without problems. 

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 1

if you use the version 2 algorithm then you are vulnerable to a failure during task commit,
but only during task commit and then if speculative/repeated tasks generate output files with
different names.

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2

If you are using S3 as a direct destination of work, then, in the absence of a consistency
layer (S3mer, EMR consistent s3, Hadoop 3,x + S3Guard) or an S3-Specific committer, you are
always at risk of data loss. Don't dp that

Further reading

To unsubscribe e-mail:

View raw message