spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ZHANG Wei <wezh...@outlook.com>
Subject Re: Cross Region Apache Spark Setup
Date Mon, 20 Apr 2020 14:39:23 GMT
There might be 3 options:

1. Just as you expect,  only ONE application, ONE rdd with regioned containers and executors
automatically allocated and distributed, the ResourceProfile (https://issues.apache.org/jira/browse/SPARK-27495)
may meet the requirement, treating Region as a type of resource just like GPU. But you have
to wait for the full feature finished. And I can image the trouble shooting challenges.
2. Label Yarn nodes with region tag, group them into queues and submit the different jobs
for different regions into dedicate queues (with –queue argument when submitting).
3. Build seperated Spark clusters with independed Yarn Resource manager for regions, such
as, UK cluster, US-east cluster, US-west cluster, looks dirty, but easy to deploy and manage,
you can schedule the job by the region busy and idle hours to get more performance and lower
cost.

Just my 2 cents

---
Cheers,
-z

________________________________________
From: Stone Zhong <stone.zhong@gmail.com>
Sent: Wednesday, April 15, 2020 4:31
To: user@spark.apache.org
Subject: Cross Region Apache Spark Setup

Hi,

I am trying to setup a cross region Apache Spark cluster. All my data are stored in Amazon
S3 and well partitioned by region.

For example, I have parquet file at
    S3://mybucket/sales_fact.parquet/us-west
    S3://mybucket/sales_fact.parquet/us-east
    S3://mybucket/sales_fact.parquet/uk

And my cluster have nodes in us-west, us-east and uk region -- basically I have node in all
region that I supported.

When I have code like:

df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*")
print(df.count()) #1
print(df.select("product_id").distinct().count()) #2

For #1, I expect only us-west nodes read data partition in us-west, and etc, and spark to
add 3 regional count and return me a total count. I do not expect large cross region data
transfer in this case.
For #2, I expect only us-west nodes read data partition in us-west, and etc. Each region,
do the distinct() locally first, and merge 3 "product_id" list and do a distinct() again,
I am ok with the necessary cross-region data transfer for merging the distinct product_ids

Can anyone please share the best practice? Is it possible to config the Apache Spark to work
in such a way?

Any idea and help is appreciated!

Thanks,
Stone

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message