spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Rawat <>
Subject FW: High Availability of Spark Driver
Date Thu, 27 Aug 2015 07:42:02 GMT
Hi Patrick,

As discussed in another thread, we are looking for a solution to the problem of lost state
on Spark Driver failure. Can you please share Spark's long term strategy for resolving this

<-- Original Mail Content Below -->

We have come across the problem of Spark Applications (on Yarn) requiring a restart in case
of Spark Driver (or application master) going down. This is hugely inconvenient for long running
applications which are maintaining a big state in memory. The repopulation of state in itself
may require a downtime of many minutes, which is not acceptable for most live systems.

As you would have noticed that Yarn community has acknowledged "long running services" as
an important class of use cases, and thus identified and removed problems in working with
long running services in Yarn.

It would be great if Spark, which is the most important processing engine on Yarn, also figures
out issues in working with long running Spark applications and publishes recommendations or
make framework changes for removing those. The need to keep the application running in case
of Driver and Application Master failure, seems to be an important requirement from this perspective.
The two most compelling use cases being:

  1.  Huge state of historical data in Spark Streaming, required for stream processing
  2.  Very large cached tables in Spark SQL (very close to our use case where we periodically
cache RDDs and query using Spark SQL)

In our analysis, for both of these use cases, a working HA solution can be built by

  1.  Preserving the state of executors (not killing them on driver failures)
  2.  Persisting some meta info required by Spark SQL and Block Manager.
  3.  Restarting and reconnecting the Driver to AM and executors

This would preserve the cached data, enabling the application to come back quickly. This can
further be extended for preserving and recovering the Computation State.

I would request you to share your thoughts on this issue and possible future directions.


View raw message