spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Berman <igor.ber...@gmail.com>
Subject Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append
Date Sat, 05 Mar 2016 19:13:54 GMT
it's not safe to use direct committer with append mode, you may loose your
data..

On 4 March 2016 at 22:59, Jelez Raditchkov <jelez@hotmail.com> wrote:

> Working on a streaming job with DirectParquetOutputCommitter to S3
> I need to use PartitionBy and hence SaveMode.Append
>
> Apparently when using SaveMode.Append spark automatically defaults to the
> default parquet output committer and ignores DirectParquetOutputCommitter.
>
> My problems are:
> 1. the copying to _temporary takes alot of time
> 2. I get job failures with: java.io.FileNotFoundException: File
> s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does
> not exist.
>
> I have set:
>         sparkConfig.set("spark.speculation", "false")
>         sc.hadoopConfiguration.set("mapreduce.map.speculative", "false")
>         sc.hadoopConfiguration.set("mapreduce.reduce.speculative",
> "false")
>
> Any ideas? Opinions? Best practices?
>
>

Mime
View raw message