spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hough, Stephen C" <>
Subject [Spark RPC] Help how to debug sudden performance issue
Date Mon, 11 Mar 2019 06:40:19 GMT
Spark Version: 2.0.2

I am running a cluster with 400 workers where each worker has 1 executor configured with 40
cores for a total capacity of 16,000 cores.
I run about 10,000 jobs with 1.5M tasks where the job is a simple spark.parallelize(list,
list.size()).map(func).collectAsysnc().  The job contains a list of tasks ranging from 1 to
'func' will run our code which will do computations then post events over Kafka and return
a Boolean, we ignore the result.  The duration of this func can be from seconds up to 20 mins.

The application context is launched on a driver server with 32 cores and the only other services
running on the box is a Kafka broker and zookeeper service.

This particular batch in development environment took around 2 hours to run which met our
SLA however when we ran in production it took 3 hours to complete, we thought it may have
been due to another cluster we were running with around 300 workers however AWS informed us
that the networks were isolated.  I moved the job to run later after the other clusters batch
had finished and the time reduced back down to 2 hrs.
I analyzed our logs and it shows that a yet to be determined incident @22:02:42 caused Spark
to 'go slow'.

By capturing the duration from the executor thread message 'Finished task' I tracked the TID
seen by the task result getter to determine duration until the result is processed on the
driver and a core is freed for the scheduler.

For the most part it is within a reasonable range of 10ms then suddenly at the given incident
time it suddenly rises to 5s, 20s, 587s, peaking at 32m only 8 mins after the incident.  So
it takes 32 mins from the time the result was sent back to spark driver to the time it is
processed which explains the performance hit because during this time the freed cores on the
worker go idle waiting for a new task.  Note I did track the time I saw our Kafka event sent
by this task and we saw it roughly 2ms later on the driver so the results are getting to server
over the network okay.

Looking at the rpc code it became apparent to me that if we start to see a build of messages
the dispatcher should turn single-threaded as it processes the backlog, so I did another scan
of the driver logs to look for long running dispatcher threads, i.e. a dispatcher that processes
more than 1 consecutive message.  A very obvious issue became apparent.

Dispatcher: 23 started 22:02:42:647 processed 80386 consecutive messages for a duration of
53 minutes.

If one looks at the beginning of these messages it is obvious that a slowdown occurs, the
first 3 are within millis of each other, then a suspicious 100ms delay starts happening.

04-03-19 22:02:43:032 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint
- Launching task 1418419 on executor id: 1048 hostname:
04-03-19 22:02:43:034 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint
- Launching task 1418420 on executor id: 967 hostname:
04-03-19 22:02:43:037 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint
- Launching task 1418421 on executor id: 791 hostname:
04-03-19 22:02:43:136 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint
- Launching task 1418422 on executor id: 941 hostname:
04-03-19 22:02:43:234 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint
- Launching task 1418423 on executor id: 1085 hostname:
04-03-19 22:02:43:348 [INFO] [dispatcher-event-loop-23] o.a.s.s.c.CoarseGrainedSchedulerBackend$DriverEndpoint
- Launching task 1418424 on executor id: 944 hostname:

Unfortunately I can't turn on any more extra logging for the DriverEndpoint 'StatusUpdate'
handler however at a guess I would say the launchTasks, executorData.executorEndpoint.send
operation is introducing some sort of blocking which causes a backlog that takes time to process.

When the system is running okay we don't see this behaviour.

Q, Have you guys seen this behaviour before, and if so would an update to Spark 2.4 do the

If not are there any extra logging or debugging I can do to track down what the external event
may be that is introducing the delay.  Given the volume of tasks I can only really analyze

View raw message