spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <>
Subject Re: Persist Dataframe to HDFS considering HDFS Block Size.
Date Sat, 19 Jan 2019 19:05:48 GMT
You can call coalesce to combine partitions..

From: Shivam Sharma <>
Sent: Saturday, January 19, 2019 7:43 AM
Subject: Persist Dataframe to HDFS considering HDFS Block Size.

Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a HIVE table using
Spark. Currently, at the time of writing to HIVE table I have set total shuffle partitions
= 400 so total 400 files are being created which is not even considering HDFS block size.
How can I tell spark to persist according to HDFS Blocks.

We have something like this HIVE which solves this problem:

set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=2048000000;
set hive.merge.size.per.task=4096000000;


Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744

View raw message