spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemant Bhanawat <>
Subject Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data
Date Thu, 20 Aug 2015 09:13:38 GMT
Looks like you are using hash based shuffling and not sort based shuffling
which creates a single file per maptask.

On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <> wrote:

> Hi I have a Spark job which deals with large skewed dataset. I have around
> 1000 Hive partitions to process in four different tables every day. So if I
> go with 200 spark.sql.shuffle.partitions default partitions created by
> Spark
> I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont
> be
> good for HDFS name node I have been told if you keep on creating such large
> no of small small files namenode will crash is it true? please help me
> understand. Anyways so to avoid creating small files I did set
> spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and as
> per my understanding because of only one output there is so much shuffling
> to do to bring all data to once reducer please correct me if I am wrong.
> This is causing memory/timeout issues how do I deal with it
> I tried to give also still this memory seems not
> enough for it. I have 25 gb executor with 4 cores and 20 such executors
> still Spark job fails please guide.
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message