spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sooyeon shin (Jira)" <>
Subject [jira] [Commented] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade
Date Thu, 23 Jul 2020 08:56:00 GMT


sooyeon shin commented on SPARK-27780:

Hi there. I had same issue.

In 3.0, dynamic allocation can be run without external shuffle service. Please see below.
--conf spark.dynamicAllocation.shuffleTracking.enabled=true


I hope it helps.

> Shuffle server & client should be versioned to enable smoother upgrade
> ----------------------------------------------------------------------
>                 Key: SPARK-27780
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle, Spark Core
>    Affects Versions: 3.1.0
>            Reporter: Imran Rashid
>            Priority: Major
> The external shuffle service is often upgraded at a different time than spark itself.
 However, this causes problems when the protocol changes between the shuffle service and the
spark runtime -- this forces users to upgrade everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what messages
the other will support.  This would allow better handling of mixed versions, from better error
msgs to allowing some mismatched versions (with reduced capabilities).
> This originally came up in a discussion here:
> There are a few ways we could do the versioning which we still need to discuss:
> 1) Version specified by config.  This allows for mixed versions across the cluster and
rolling upgrades.  It also will let a spark 3.0 client talk to a 2.4 shuffle service.  But,
may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes the versioning
easy for the end user, and can even handle a 2.4 shuffle service though it does not support
the new versioning.  However, it will not handle a rolling upgrade correctly -- if the local
shuffle service has been upgraded, but other nodes in the cluster have not, it will get the
version wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server & client
could first exchange messages with their versions, so they know how to continue communication
after that.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message