spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jestin Ma <>
Subject Re: Tuning level of Parallelism: Increase or decrease?
Date Mon, 01 Aug 2016 15:56:54 GMT
Hi Nikolay, I'm looking at data locality improvements for Spark, and I have
conflicting sources on using YARN for Spark.

Reynold said that Spark workers automatically take care of data locality

However, I've read elsewhere (
that Spark on YARN increases data locality because YARN tries to place
tasks next to HDFS blocks.

Can anyone verify/support one side or the other?

Thank you,

On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet <> wrote:

> Hi.
> Maybe you can help "data locality"..
> If you use groupBY and joins, than most likely you will see alot of
> network operations. This can be werry slow. You can try prepare, transform
> your information in that way, what can minimize transporting temporary
> information between worker-nodes.
> Try google in this way "Data locality in Hadoop"
> 2016-08-01 4:41 GMT+03:00 Jestin Ma <>:
>> It seems that the number of tasks being this large do not matter. Each
>> task was set default by the HDFS as 128 MB (block size) which I've heard to
>> be ok. I've tried tuning the block (task) size to be larger and smaller to
>> no avail.
>> I tried coalescing to 50 but that introduced large data skew and slowed
>> down my job a lot.
>> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <>
>> wrote:
>>> 15000 seems like a lot of tasks for that size. Test it out with a
>>> .coalesce(50) placed right after loading the data. It will probably either
>>> run faster or crash with out of memory errors.
>>> On Jul 29, 2016, at 9:02 AM, Jestin Ma <>
>>> wrote:
>>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task
>>> is equal to the block size specified by hdfs, which happens to be 128 MB,
>>> leading to about 15000 tasks.
>>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>>> I'm performing groupBy, count, and an outer-join with another DataFrame
>>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
>>> to disk.
>>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>> I read on the Spark Tuning guide that:
>>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>> This means that I should have about 30-50 tasks instead of 15000, and
>>> each task would be much bigger in size. Is my understanding correct, and is
>>> this suggested? I've read from difference sources to decrease or increase
>>> parallelism, or even keep it default.
>>> Thank you for your help,
>>> Jestin

View raw message