spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ehbhaskar <ehbhas...@gmail.com>
Subject [Spark SQL] INSERT OVERWRITE to a hive partitioned table (pointing to s3) from spark is too slow.
Date Mon, 05 Nov 2018 06:58:13 GMT
I have a pyspark job that inserts data into hive partitioned table using
`Insert Overwrite` statement.

Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in
S3. But, it's very slow in moving data from temp directory to the target
path, it takes more than 40 mins to move data from temp to target path.

I set the option mapreduce.fileoutputcommitter.algorithm.version=2 (default
is 1) but still I see no change.

*Are there any ways to improve the performance of hive INSERT OVERWRITE
query from spark?*

Also, I noticed that this behavior is even worse (i.e. job takes even more
time) with hive table that has too many existing partitions. i.e. The data
loads relatively fast into table that have less existing partitions.

*Some additional details:*
* Table is a dynamic partitioned table. 
* Spark version - 2.3.0
* Hive version - 2.3.2-amzn-2
* Hadoop version - 2.8.3-amzn-0

PS: Other config options I have tried that didn't have much effect on the
job performance.
* "hive.load.dynamic.partitions.thread - "10"
* "hive.mv.files.thread" - "30"
* "fs.trash.interval" - "0".



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message