spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <ever...@nuna.com.INVALID>
Subject Re: S3A + EMR failure when writing Parquet?
Date Sun, 28 Aug 2016 23:19:15 GMT
(Sorry, typo -- I was using spark.hadoop.mapreduce.
fileoutputcommitter.algorithm.version=2 not 'hadooop', of course)

On Sun, Aug 28, 2016 at 12:51 PM, Everett Anderson <everett@nuna.com> wrote:

> Hi,
>
> I'm having some trouble figuring out a failure when using S3A when writing
> a DataFrame as Parquet on EMR 4.7.2 (which is Hadoop 2.7.2 and Spark
> 1.6.2). It works when using EMRFS (s3://), though.
>
> I'm using these extra conf params, though I've also tried without
> everything but the encryption one with the same result:
>
> --conf spark.hadooop.mapreduce.fileoutputcommitter.algorithm.version=2
> --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true
> --conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256
> --conf spark.sql.parquet.output.committer.class=org.apache.
> spark.sql.parquet.DirectParquetOutputCommitter
>
> It looks like it does actually write the parquet shards under
>
> <output root S3>/_temporary/0/_temporary/<attempt>/
>
> but then must hit that S3 exception when trying to copy/rename. I think
> the NullPointerException deep down in Parquet is due to it causing close()
> more than once so isn't the root cause, but I'm not sure.
>
> Anyone seen something like this?
>
> 16/08/28 19:46:28 ERROR InsertIntoHadoopFsRelation: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 1.0
failed 4 times, most recent failure: Lost task 9.3 in stage 1.0 (TID 54, ip-10-8-38-103.us-west-2.compute.internal):
org.apache.spark.SparkException: Task failed while writing rows
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:269)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Failed to commit task
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$commitTask$1(WriterContainer.scala:283)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:265)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:260)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:260)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1277)
> 	at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:266)
> 	... 8 more
> 	Suppressed: java.lang.NullPointerException
> 		at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
> 		at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
> 		at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
> 		at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
> 		at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.org$apache$spark$sql$execution$datasources$DefaultWriterContainer$$abortTask$1(WriterContainer.scala:290)
> 		at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$2.apply$mcV$sp(WriterContainer.scala:266)
> 		at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1286)
> 		... 9 more
> Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service:
Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: EA0E434768316935), S3 Extended
Request ID: fHtu7Q9VSi/8h0RAyfRiyK6uAJnajZBrwqZH3eBfF5kM13H6dDl006031NTwU/whyGu1uNqW1mI=
> 	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389)
> 	at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
> 	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
> 	at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
> 	at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
> 	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
> 	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
> 	at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1405)
> 	at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInOneChunk(UploadCallable.java:131)
> 	at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:123)
> 	at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:139)
> 	at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:47)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	... 3 more
>
>
>

Mime
View raw message