spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yong Zhang <java8...@hotmail.com>
Subject Re: Tuning level of Parallelism: Increase or decrease?
Date Wed, 03 Aug 2016 11:01:39 GMT
Data Locality is part of job/task scheduling responsibility. So both links you specified originally
are correct, one is for the standalone mode comes with Spark, another is for the YARN. Both
have this ability.


But YARN, as a very popular scheduling component, comes with MUCH, MUCH more features than
the Standalone mode. You can research more on google about it.


Yong


________________________________
From: Jestin Ma <jestinwith.an.e@gmail.com>
Sent: Tuesday, August 2, 2016 7:11 PM
To: Jacek Laskowski
Cc: Nikolay Zhebet; Andrew Ehrlich; user
Subject: Re: Tuning level of Parallelism: Increase or decrease?

Hi Jacek,
I found this page of your book here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html
Data Locality ยท Mastering Apache Spark<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html>
jaceklaskowski.gitbooks.io
Spark relies on data locality, aka data placement or proximity to data source, that makes
Spark jobs sensitive to where the data is located. It is therefore important ...


which says:  "It is therefore important to have Spark running on Hadoop YARN cluster<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/>
if the data comes from HDFS. In Spark on YARN<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/>
Spark tries to place tasks alongside HDFS blocks."


So my reasoning was that since Spark takes care of data locality when workers load data from
HDFS, I can't see why running on YARN is more important.

Hope this makes my question clearer.


On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski <jacek@japila.pl<mailto:jacek@japila.pl>>
wrote:
On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma <jestinwith.an.e@gmail.com<mailto:jestinwith.an.e@gmail.com>>
wrote:
> Hi Nikolay, I'm looking at data locality improvements for Spark, and I have
> conflicting sources on using YARN for Spark.
>
> Reynold said that Spark workers automatically take care of data locality
> here:
> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>
> However, I've read elsewhere
> (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
> that Spark on YARN increases data locality because YARN tries to place tasks
> next to HDFS blocks.
>
> Can anyone verify/support one side or the other?

Hi Jestin,

I'm the author of the latter. I can't seem to find how Reynold
"conflicts" with what I wrote in the notes? Could you elaborate?

I certainly may be wrong.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


Mime
View raw message