Chetan,

When you're writing to a partitioned table, you want to use a shuffle to avoid the situation where each task has to write to every partition. You can do that either by adding a repartition by your table's partition keys, or by adding an order by with the partition keys and then columns you normally use to filter when reading the table. I generally recommend the second approach because it handles skew and prepares the data for more efficient reads.

If that doesn't help, then you should look at your memory settings. When you're getting killed by YARN, you should consider setting `spark.shuffle.io.preferDirectBufs=false` so you use less off-heap memory that the JVM doesn't account for. That is usually an easier fix than increasing the memory overhead. Also, when you set executor memory, always change spark.memory.fraction to ensure the memory you're adding is used where it is needed. If your memory fraction is the default 60%, then 60% of the memory will be used for Spark execution, not reserved whatever is consuming it and causing the OOM. (If Spark's memory is too low, you'll see other problems like spilling too much to disk.)

rb

On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri <chetan.opensource@gmail.com> wrote:
Can anyone please guide me with above issue.


On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri <chetan.opensource@gmail.com> wrote:
Hello Spark Users,

I have Hbase table reading and writing to Hive managed table where i applied partitioning by date column which worked fine but it has generate more number of files in almost 700 partitions but i wanted to use reparation to reduce File I/O by reducing number of files inside each partition.

But i ended up with below exception:

ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 14.0 GB of 14 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 

Driver memory=4g, executor mem=12g, num-executors=8, executor core=8

Do you think below setting can help me to overcome above issue:

spark.default.parellism=1000
spark.sql.shuffle.partitions=1000

Because default max number of partitions are 1000.






--
Ryan Blue
Software Engineer
Netflix