We are using the InsertInto method of dataframe to write into an object store backed hive table in Google cloud. We have observed slowness in this approach.
From the internet, we got to know
Writes to Hive tables in Spark happen in a two-phase manner.
- Step 1 – DistributedWrite: Data is written to a Hive staging directory using OutputCommitter. This step happens in a distributed manner in multiple executors.
- Step 2 – FinalCopy: After the new files are written to the Hive staging directory, they are moved to the final location using the FileSystem rename API. This step unfortunately happens serially in the driver. As part of this, the metastore is also updated with the new partition information.
We thought of using saving the data directly in the path and then programmatically adding the partitions and doing a msck repair table to save time in the rename operation. Are there any other elegant ways to implement this so that the FinalCopy step (rename API operation) can be eliminated.
Need suggestions to speed up this write.
Few things to consider:
1. We get old data as well as new data. So there will be new partitions as well as upserts to old partitions.
2. Insert overwrite can happen into static and dynamic partitions.
Looking forward to a solution.