spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ZHANG Wei <>
Subject Re: Cross Region Apache Spark Setup
Date Mon, 20 Apr 2020 14:39:23 GMT
There might be 3 options:

1. Just as you expect,  only ONE application, ONE rdd with regioned containers and executors
automatically allocated and distributed, the ResourceProfile (
may meet the requirement, treating Region as a type of resource just like GPU. But you have
to wait for the full feature finished. And I can image the trouble shooting challenges.
2. Label Yarn nodes with region tag, group them into queues and submit the different jobs
for different regions into dedicate queues (with –queue argument when submitting).
3. Build seperated Spark clusters with independed Yarn Resource manager for regions, such
as, UK cluster, US-east cluster, US-west cluster, looks dirty, but easy to deploy and manage,
you can schedule the job by the region busy and idle hours to get more performance and lower

Just my 2 cents


From: Stone Zhong <>
Sent: Wednesday, April 15, 2020 4:31
Subject: Cross Region Apache Spark Setup


I am trying to setup a cross region Apache Spark cluster. All my data are stored in Amazon
S3 and well partitioned by region.

For example, I have parquet file at

And my cluster have nodes in us-west, us-east and uk region -- basically I have node in all
region that I supported.

When I have code like:

df ="S3://mybucket/sales_fact.parquet/*")
print(df.count()) #1
print("product_id").distinct().count()) #2

For #1, I expect only us-west nodes read data partition in us-west, and etc, and spark to
add 3 regional count and return me a total count. I do not expect large cross region data
transfer in this case.
For #2, I expect only us-west nodes read data partition in us-west, and etc. Each region,
do the distinct() locally first, and merge 3 "product_id" list and do a distinct() again,
I am ok with the necessary cross-region data transfer for merging the distinct product_ids

Can anyone please share the best practice? Is it possible to config the Apache Spark to work
in such a way?

Any idea and help is appreciated!


To unsubscribe e-mail:

View raw message