hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-10550) Dynamic RDD caching optimization for HoS.[Spark Branch]
Date Wed, 13 May 2015 09:34:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541655#comment-14541655
] 

ASF GitHub Bot commented on HIVE-10550:
---------------------------------------

GitHub user ChengXiangLi opened a pull request:

    https://github.com/apache/hive/pull/36

    HIVE-10550 Dynamic RDD caching optimization for Hive on Spark.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ChengXiangLi/hive dynamicrddcaching

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hive/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #36
    
----
commit 8c7668613939cdde44873a8aebe14493acfc6357
Author: chengxiang li <chengxiang.li@intel.com>
Date:   2015-05-13T09:37:59Z

    HIVE-10550 Dynamic RDD caching optimization for Hive on Spark.

----


> Dynamic RDD caching optimization for HoS.[Spark Branch]
> -------------------------------------------------------
>
>                 Key: HIVE-10550
>                 URL: https://issues.apache.org/jira/browse/HIVE-10550
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Chengxiang Li
>            Assignee: Chengxiang Li
>         Attachments: HIVE-10550.1.patch
>
>
> A Hive query may try to scan the same table multi times, like self-join, self-union,
or even share the same subquery, [TPC-DS Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
is an example. As you may know that, Spark support cache RDD data, which mean Spark would
put the calculated RDD data in memory and get the data from memory directly for next time,
this avoid the calculation cost of this RDD(and all the cost of its dependencies) at the cost
of more memory usage. Through analyze the query context, we should be able to understand which
part of query could be shared, so that we can reuse the cached RDD in the generated Spark
job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message