spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Vanzin (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-19528) external shuffle service registration timeout is very short with heavy workloads when dynamic allocation is enabled
Date Thu, 07 Feb 2019 18:36:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marcelo Vanzin resolved SPARK-19528.
------------------------------------
    Resolution: Duplicate

I believe this is the same issue as SPARK-24355. There's a new configuration you can use to
reserve some RPC resources in the shuffle service to non-shuffle data requests, such as authentication.

> external shuffle service registration timeout is very short with heavy workloads when
dynamic allocation is enabled 
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19528
>                 URL: https://issues.apache.org/jira/browse/SPARK-19528
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager, Shuffle, Spark Core
>    Affects Versions: 1.6.2, 1.6.3, 2.0.2
>         Environment: Hadoop2.7.1
> spark1.6.2
> hive2.2
>            Reporter: KaiXu
>            Priority: Major
>         Attachments: SPARK-19528.1.patch, SPARK-19528.1.spark2.patch
>
>
> when dynamic allocation is enabled, the external shuffle service is used for maintain
the unfinished status between executors. So the external shuffle service should not close
before the executor while still have request from executor.
> container's log:
> {noformat}
> 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.1.1:41867
> 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully registered
with driver
> 17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host hsx-node8
> 17/02/09 08:30:46 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService'
on port 40374.
> 17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on 40374
> 17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = 7337
> 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register BlockManager
> 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager
> 17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local external
shuffle service.
> 17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 requests outstanding
when connection from hsx-node8/192.168.1.8:7337 is closed
> 17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external shuffle server,
will retry 2 more times after waiting 5 seconds...
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for
task.
> 	at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
> 	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
> 	at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144)
> 	at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
> 	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 	at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215)
> 	at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201)
> 	at org.apache.spark.executor.Executor.<init>(Executor.scala:86)
> 	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
> 	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
> 	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
> 	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> 	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
> 	at org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
> 	at org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
> 	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274)
> 	... 14 more
> 17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external shuffle server,
will retry 1 more times after waiting 5 seconds...
> {noformat}
> nodemanager's log:
> {noformat}
> 2017-02-09 08:30:48,836 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
Removed completed containers from NM context: [container_1486564603520_0097_01_000005]
> 2017-02-09 08:31:12,122 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exit code from container container_1486564603520_0096_01_000071 is : 1
> 2017-02-09 08:31:12,122 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exception from container-launch with container ID: container_1486564603520_0096_01_000071
and exit code: 1
> ExitCodeException exitCode=1:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
>         at org.apache.hadoop.util.Shell.run(Shell.java:456)
>         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
>         at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
Exception from container-launch.
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
Container id: container_1486564603520_0096_01_000071
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
Exit code: 1
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
Stack trace: ExitCodeException exitCode=1:
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at org.apache.hadoop.util.Shell.run(Shell.java:456)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      at java.lang.Thread.run(Thread.java:745)
> 2017-02-09 08:31:12,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Container exited with a non-zero exit code 1
> 2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
Container container_1486564603520_0096_01_000071 transitioned from RUNNING to EXITED_WITH_FAILURE
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message