hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
Date Thu, 02 Nov 2017 02:43:02 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235094#comment-16235094
] 

liyunzhang commented on HIVE-17486:
-----------------------------------

[~xuefuz]:

{quote}
 My gut feeling is that this needs to be combined with Spark RDD caching or Hive's materialized
view.
{quote}
 About the optimization, I found that Hive on Tez can get indeed improvement(20%+) in TPC-DS/query28,88,90
on not excellent hw or in table scan with huge data. So I want to implement it on the Hive
on Spark.  
 I agree that we need to combine Spark RDD caching with the optimization to reduce the table
scan. As you described, the multi-insert case  benefits from the Spark RDD caching because
map12=map13. But more complex cases can not. Use DS/query28.sql as an example.
 The physical plan:
 {code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
TS[7]-FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
TS[14]-FIL[54]-SEL[16]-GBY[17]-RS[18]-GBY[19]-RS[44]-JOIN[48]
TS[21]-FIL[55]-SEL[23]-GBY[24]-RS[25]-GBY[26]-RS[45]-JOIN[48]
TS[28]-FIL[56]-SEL[30]-GBY[31]-RS[32]-GBY[33]-RS[46]-JOIN[48]
TS[35]-FIL[57]-SEL[37]-GBY[38]-RS[39]-GBY[40]-RS[47]-JOIN[48]
{code}

After the scan share optimization, the phyiscal plan
{code}
TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]-GBY[5]-RS[42]-JOIN[48]-SEL[49]-LIM[50]-FS[51]
     -FIL[53]-SEL[9]-GBY[10]-RS[11]-GBY[12]-RS[43]-JOIN[48]
     -FIL[54]-SEL[16]-GBY[17]-RS[18]-GBY[19]-RS[44]-JOIN[48]
     -FIL[55]-SEL[23]-GBY[24]-RS[25]-GBY[26]-RS[45]-JOIN[48]
     -FIL[56]-SEL[30]-GBY[31]-RS[32]-GBY[33]-RS[46]-JOIN[48]
     -FIL[57]-SEL[37]-GBY[38]-RS[39]-GBY[40]-RS[47]-JOIN[48]

{code}

HoS will split operators trees when encounting {{RS}}.
{code}
Map1: TS[0]-FIL[52]-SEL[2]-GBY[3]-RS[4]
Map2: TS[0]-FIL[53]-SEL[9]-GBY[10]-RS[11]
Map3: TS[0]-FIL[54]-SEL[16]-GBY[17]-RS[18]
Map4: TS[0]-FIL[55]-SEL[23]-GBY[24]-RS[25]
Map5: TS[0] -FIL[56]-SEL[30]-GBY[31]-RS[32]
Map6: TS[0]-FIL[57]-SEL[37]-GBY[38]-RS[39]
{code}

We can not combine Map1,..., Map6 because the {{FIL}}(FIL\[52\], FIL\[53\],...,FIL\[57\])
are not same.
So what i think about can we directly extract TS from MapTask and put the TS to a single Map
{code}
Map0: TS[0]
Map1: FIL[52]-SEL[2]-GBY[3]-RS[4]
Map2: FIL[53]-SEL[9]-GBY[10]-RS[11]
Map3: FIL[54]-SEL[16]-GBY[17]-RS[18]
Map4: FIL[55]-SEL[23]-GBY[24]-RS[25]
Map5: FIL[56]-SEL[30]-GBY[31]-RS[32]
Map6: FIL[57]-SEL[37]-GBY[38]-RS[39]
{code}
There is only TS\[0\] in the Map0 and connect Map0 to Map1,...,Map6.  Appreciate to get some
suggestion from you!


> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
>                 Key: HIVE-17486
>                 URL: https://issues.apache.org/jira/browse/HIVE-17486
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
>            Priority: Major
>         Attachments: scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be merged
so the data is read only once. Optimization will be carried out at the physical level.  In
Hive on Spark, it caches the result of spark work if the spark work is used by more than 1
child spark work. After sharedWorkOptimizer is enabled in physical plan in HoS, the identical
table scans are merged to 1 table scan. This result of table scan will be used by more 1 child
spark work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message