spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: [Spark SQL] INSERT OVERWRITE to a hive partitioned table (pointing to s3) from spark is too slow.
Date Mon, 05 Nov 2018 07:08:25 GMT
Can you share some relevant source code?


> Am 05.11.2018 um 07:58 schrieb ehbhaskar <ehbhaskar@gmail.com>:
> 
> I have a pyspark job that inserts data into hive partitioned table using
> `Insert Overwrite` statement.
> 
> Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in
> S3. But, it's very slow in moving data from temp directory to the target
> path, it takes more than 40 mins to move data from temp to target path.
> 
> I set the option mapreduce.fileoutputcommitter.algorithm.version=2 (default
> is 1) but still I see no change.
> 
> *Are there any ways to improve the performance of hive INSERT OVERWRITE
> query from spark?*
> 
> Also, I noticed that this behavior is even worse (i.e. job takes even more
> time) with hive table that has too many existing partitions. i.e. The data
> loads relatively fast into table that have less existing partitions.
> 
> *Some additional details:*
> * Table is a dynamic partitioned table. 
> * Spark version - 2.3.0
> * Hive version - 2.3.2-amzn-2
> * Hadoop version - 2.8.3-amzn-0
> 
> PS: Other config options I have tried that didn't have much effect on the
> job performance.
> * "hive.load.dynamic.partitions.thread - "10"
> * "hive.mv.files.thread" - "30"
> * "fs.trash.interval" - "0".
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message