spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jestin Ma <jestinwith.a...@gmail.com>
Subject Re: Tuning level of Parallelism: Increase or decrease?
Date Mon, 01 Aug 2016 01:41:10 GMT
It seems that the number of tasks being this large do not matter. Each task
was set default by the HDFS as 128 MB (block size) which I've heard to be
ok. I've tried tuning the block (task) size to be larger and smaller to no
avail.

I tried coalescing to 50 but that introduced large data skew and slowed
down my job a lot.

On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <andrew@aehrlich.com> wrote:

> 15000 seems like a lot of tasks for that size. Test it out with a
> .coalesce(50) placed right after loading the data. It will probably either
> run faster or crash with out of memory errors.
>
> On Jul 29, 2016, at 9:02 AM, Jestin Ma <jestinwith.an.e@gmail.com> wrote:
>
> I am processing ~2 TB of hdfs data using DataFrames. The size of a task is
> equal to the block size specified by hdfs, which happens to be 128 MB,
> leading to about 15000 tasks.
>
> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
> I'm performing groupBy, count, and an outer-join with another DataFrame of
> ~200 MB size (~80 MB cached but I don't need to cache it), then saving to
> disk.
>
> Right now it takes about 55 minutes, and I've been trying to tune it.
>
> I read on the Spark Tuning guide that:
> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>
> This means that I should have about 30-50 tasks instead of 15000, and each
> task would be much bigger in size. Is my understanding correct, and is this
> suggested? I've read from difference sources to decrease or increase
> parallelism, or even keep it default.
>
> Thank you for your help,
> Jestin
>
>
>

Mime
View raw message