hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document
Date Thu, 01 Dec 2016 01:39:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15710505#comment-15710505
] 

Rui Li commented on HIVE-15313:
-------------------------------

Seems these two configs are useful in several ways :) I'm also looking at them in HIVE-15302.
My plan is to identify the minimum set of needed jars (good for performance the avoid conflicts)
and update the wiki.
We can also make code change to automatically set it if user hasn't.

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-15313
>                 URL: https://issues.apache.org/jira/browse/HIVE-15313
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Priority: Minor
>         Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing the source
code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in spark configuration
file, it will first copy all jars in $SPARK_HOME/jars to a tmp directory and upload the tmp
directory to distribute cache. Comparing [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],

> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a tmp directory
if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg "spark.yarn.archive" or
"spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
>  zip spark-archive.zip $SPARK_HOME/jars/*
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail performance
impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message