spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivam Sharma <28shivamsha...@gmail.com>
Subject Out Of Memory while reading a table partition from HIVE
Date Fri, 17 May 2019 15:23:25 GMT
Hi All,

I am getting Out Of Memory due to GC overhead while reading a table from
HIVE from spark like:

spark.sql("SELECT * FROM some.table where date='2019-05-14' LIMIT
> 10").show()


So when I run above command in spark-shell then it starts processing *1780
tasks* where it goes OOM at a specific partition.

1. Table partition(*date='2019-05-14'*) is having *4000* files on HDFS so
ideally 4000 partitions should be created inside Spark Dataframe if I am
not wrong. I analyzed the table actually it is having total *1780*
partitions(means
1780 dates folder).

2. I checked the size of files in Table partition(*date='2019-05-14'*), max
file size is *1.1 GB* and I have given *7GB* to each executor so if I am
right above then it should not throw OOM.

3. And when I have put the* LIMIT 10* then does spark-hive reads all files?

Thanks

-- 
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing
Jabalpur
Email:- 28shivamsharma@gmail.com
LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
<https://www.linkedin.com/in/28shivamsharma>*

Mime
View raw message