spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liupengcheng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22903) AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber
Date Tue, 02 Jan 2018 03:15:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307633#comment-16307633
] 

liupengcheng commented on SPARK-22903:
--------------------------------------

[~imranr] I agree what you said about the cleanup did not complete may cause the exception,
but the ResultTask saving path is a user spec args, and it will be serialized into the func
closure, I don't think we can or we should get it from the closure and do the cleanup.
What's more, another case will also cause the problem, currently spark will keep fetchfailed
stage tasks running and will not kill them, so when the failed stage task not finished and
the same task of next stage is submitted, the conflict may cause the exception.


> AlreadyBeingCreatedException in stage retry caused by wrong attemptNumber
> -------------------------------------------------------------------------
>
>                 Key: SPARK-22903
>                 URL: https://issues.apache.org/jira/browse/SPARK-22903
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.3.0
>         Environment: Spark2.1.0 + yarn
>            Reporter: liupengcheng
>              Labels: core
>
> We  submit a Spark2.1.0 spark job, however, when MetadataFetchFailed exception ocurred,
stage is being retried, but a AlreadyBeingCreatedException is thrown and finally caused job
failure.
> {noformat}
> 2017-12-21,21:30:58,406 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 13.0
in stage 7.1 (TID 18990, <host>, executor 326): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
Failed to create file [/<outputpath>/_temporary/0/_temporary/attempt_201712211720_0026_r_000014_0/part-r-00014.snappy.parquet]
for [DFSClient_NONMAPREDUCE_-1477691024_103] for client [10.136.42.10], because this file
is already being created by [DFSClient_NONMAPREDUCE_940892524_103] on [10.118.21.26]
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2672)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2388)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2317)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2270)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:604)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:374)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1806)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1477)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1408)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy21.create(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:301)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy22.create(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1779)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1773)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1698)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:433)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:429)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:444)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:373)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:806)
>         at org.apache.parquet.hadoop.ParquetFileWriter.&lt;init&gt;(ParquetFileWriter.java:176)
>         at org.apache.parquet.hadoop.ParquetFileWriter.&lt;init&gt;(ParquetFileWriter.java:160)
>         at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
>         at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>         at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1108)
>         at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>         at org.apache.spark.scheduler.Task.run(Task.scala:90)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:241)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message