spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ehrlich <and...@aehrlich.com>
Subject Re: Tuning level of Parallelism: Increase or decrease?
Date Mon, 01 Aug 2016 00:27:49 GMT
15000 seems like a lot of tasks for that size. Test it out with a .coalesce(50) placed right
after loading the data. It will probably either run faster or crash with out of memory errors.

> On Jul 29, 2016, at 9:02 AM, Jestin Ma <jestinwith.an.e@gmail.com> wrote:
> 
> I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the
block size specified by hdfs, which happens to be 128 MB, leading to about 15000 tasks.
> 
> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
> I'm performing groupBy, count, and an outer-join with another DataFrame of ~200 MB size
(~80 MB cached but I don't need to cache it), then saving to disk.
> 
> Right now it takes about 55 minutes, and I've been trying to tune it.
> 
> I read on the Spark Tuning guide that:
> In general, we recommend 2-3 tasks per CPU core in your cluster.
> 
> This means that I should have about 30-50 tasks instead of 15000, and each task would
be much bigger in size. Is my understanding correct, and is this suggested? I've read from
difference sources to decrease or increase parallelism, or even keep it default.
> 
> Thank you for your help,
> Jestin


Mime
View raw message