spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niranda Perera <niranda.per...@gmail.com>
Subject Re: Possible deadlock in registering applications in the recovery mode
Date Mon, 18 Apr 2016 01:40:07 GMT
Hi guys,

Any update on this?

Best

On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <niranda.perera@gmail.com>
wrote:

> Hi all,
>
> I have encountered a small issue in the standalone recovery mode.
>
> Let's say there was an application A running in the cluster. Due to some
> issue, the entire cluster, together with the application A goes down.
>
> Then later on, cluster comes back online, and the master then goes into
> the 'recovering' mode, because it sees some apps, workers and drivers have
> already been in the cluster from Persistence Engine. While in the recovery
> process, the application comes back online, but now it would have a
> different ID, let's say B.
>
> But then, as per the master, application registration logic, this
> application B will NOT be added to the 'waitingApps' with the message
> ""Attempted to re-register application at same address". [1]
>
>   private def registerApplication(app: ApplicationInfo): Unit = {
>     val appAddress = app.driver.address
>     if (addressToApp.contains(appAddress)) {
>       logInfo("Attempted to re-register application at same address: " +
> appAddress)
>       return
>     }
>
>
> The problem here is, master is trying to recover application A, which is
> not in there anymore. Therefore after the recovery process, app A will be
> dropped. However app A's successor, app B was also omitted from the
> 'waitingApps' list because it had the same address as App A previously.
>
> This creates a deadlock in the cluster, app A nor app B is available in
> the cluster.
>
> When the master is in the RECOVERING mode, shouldn't it add all the
> registering apps to a list first, and then after the recovery is completed
> (once the unsuccessful recoveries are removed), deploy the apps which are
> new?
>
> This would sort this deadlock IMO?
>
> look forward to hearing from you.
>
> best
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
>
> --
> Niranda
> @n1r44 <https://twitter.com/N1R44>
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>



-- 
Niranda
@n1r44 <https://twitter.com/N1R44>
+94-71-554-8430
https://pythagoreanscript.wordpress.com/

Mime
View raw message