spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivam Sharma <>
Subject GC overhead while read a table partition from HIVE
Date Thu, 16 May 2019 13:08:01 GMT
Hi All,

I am getting GC overhead while reading a table from HIVE from spark like:

spark.sql("SELECT * FROM some.table where date='2019-05-14' LIMIT
> 10").show()

So when I run above command in spark-shell then it starts processing *1780
tasks* where it goes OOM at a specific partition.

1. Table partition(*date='2019-05-14'*) is having *4000* files on HDFS so
ideally 4000 partitions should be created inside Spark Dataframe if I am
not wrong. I analyzed the table actually it is having total *1780*
partitions(means 1780 dates folder).

2. I checked the size of files in Table partition(*date='2019-05-14'*), max
file size is *1.1 GB* and I have given *7GB* to each executor so if I am
right above then it should not throw OOM.

3. And when I have put the* LIMIT 10* then does spark-hive reads all files?


Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing

View raw message