spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhaskar Ebbur <ehbhas...@gmail.com>
Subject Re: [Spark SQL] INSERT OVERWRITE to a hive partitioned table (pointing to s3) from spark is too slow.
Date Mon, 05 Nov 2018 07:30:02 GMT
Here's some sample code.

self.session = SparkSession \
            .builder \
            .appName(self.app_name) \
            .config("spark.dynamicAllocation.enabled", "false") \
            .config("hive.exec.dynamic.partition.mode", "nonstrict") \
            .config("mapreduce.fileoutputcommitter.algorithm.version", "2")
\
            .config("hive.load.dynamic.partitions.thread", "10") \
            .config("hive.mv.files.thread", "30") \
            .config("fs.trash.interval", "0") \
            .enableHiveSupport()

columns_with_default = "col1, NULL as col2, col2, col4, NULL as col5,
partition_col1, partition_col2"
source_data_df_to_write = self.session.sql(
                 "SELECT %s, %s, %s as %s, %s as %s FROM TEMP_VIEW" %
(columns_with_default))
         source_data_df_to_write\
             .coalesce(50)\
             .createOrReplaceTempView("TEMP_VIEW")

table_name_abs = "%s.%s" % (self.database, self.target_table)
self.session.sql(
    "INSERT OVERWRITE TABLE %s "
    "PARTITION (%s) "
    "SELECT %s FROM TEMP_VIEW" % (
        table_name_abs, "partition_col1, partition_col2",
columns_with_default))


On Sun, Nov 4, 2018 at 11:08 PM Jörn Franke <jornfranke@gmail.com> wrote:

> Can you share some relevant source code?
>
>
> > Am 05.11.2018 um 07:58 schrieb ehbhaskar <ehbhaskar@gmail.com>:
> >
> > I have a pyspark job that inserts data into hive partitioned table using
> > `Insert Overwrite` statement.
> >
> > Spark job loads data quickly (in 15 mins) to temp directory
> (~/.hive-***) in
> > S3. But, it's very slow in moving data from temp directory to the target
> > path, it takes more than 40 mins to move data from temp to target path.
> >
> > I set the option mapreduce.fileoutputcommitter.algorithm.version=2
> (default
> > is 1) but still I see no change.
> >
> > *Are there any ways to improve the performance of hive INSERT OVERWRITE
> > query from spark?*
> >
> > Also, I noticed that this behavior is even worse (i.e. job takes even
> more
> > time) with hive table that has too many existing partitions. i.e. The
> data
> > loads relatively fast into table that have less existing partitions.
> >
> > *Some additional details:*
> > * Table is a dynamic partitioned table.
> > * Spark version - 2.3.0
> > * Hive version - 2.3.2-amzn-2
> > * Hadoop version - 2.8.3-amzn-0
> >
> > PS: Other config options I have tried that didn't have much effect on the
> > job performance.
> > * "hive.load.dynamic.partitions.thread - "10"
> > * "hive.mv.files.thread" - "30"
> > * "fs.trash.interval" - "0".
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>

Mime
View raw message