spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemant Bhanawat <hemant9...@gmail.com>
Subject Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data
Date Thu, 20 Aug 2015 12:29:47 GMT
Sorry, I misread your mail. Thanks for pointing that out.

BTW, are the 80000 files shuffle intermediate output and not the final
output? I assume yes. I didn't know that you can keep intermediate output
on HDFS and I don't think that is recommended.




On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat <hemant9379@gmail.com>
wrote:

> Looks like you are using hash based shuffling and not sort based shuffling
> which creates a single file per maptask.
>
> On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <umesh.kacha@gmail.com> wrote:
>
>> Hi I have a Spark job which deals with large skewed dataset. I have around
>> 1000 Hive partitions to process in four different tables every day. So if
>> I
>> go with 200 spark.sql.shuffle.partitions default partitions created by
>> Spark
>> I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont
>> be
>> good for HDFS name node I have been told if you keep on creating such
>> large
>> no of small small files namenode will crash is it true? please help me
>> understand. Anyways so to avoid creating small files I did set
>> spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and
>> as
>> per my understanding because of only one output there is so much shuffling
>> to do to bring all data to once reducer please correct me if I am wrong.
>> This is causing memory/timeout issues how do I deal with it
>>
>> I tried to give spark.shuffle.storage=0.7 also still this memory seems not
>> enough for it. I have 25 gb executor with 4 cores and 20 such executors
>> still Spark job fails please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Mime
View raw message