hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-20252) Semijoin Reduction : Cycles due to semi join branch may remain undetected if small table side has a map join upstream.
Date Wed, 01 Aug 2018 20:40:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-20252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gopal V updated HIVE-20252:
---------------------------
    Description: 
For eg,

 {noformat}
 # 2018-07-26T17:22:14,664 DEBUG [51377701-dc98-424f-82e0-bbb5d6c84316 main] optimizer.SharedWorkOptimizer:
Before SharedWorkOptimizer:
 # TS[0]-FIL[96]-SEL[2]-MAPJOIN[156]-MAPJOIN[157]-MAPJOIN[161]-MAPJOIN[162]-FIL[47]-SEL[48]-MAPJOIN[163]-FIL[66]-SEL[67]-TNK[105]-GBY[68]-RS[69]-GBY[70]-SEL[71]-RS[72]-SEL[73]-LIM[74]-FS[75]
 #                                                          
-SEL[142]-GBY[143]-RS[144]-GBY[145]-RS[155]
 # TS[3]-FIL[97]-SEL[5]-RS[34]-MAPJOIN[156]
 # TS[6]-FIL[98]-SEL[8]-RS[37]-MAPJOIN[157]
 # TS[9]-FIL[99]-SEL[11]-MAPJOIN[158]-GBY[40]-RS[42]-MAPJOIN[161]
 # TS[12]-FIL[100]-SEL[14]-RS[16]-MAPJOIN[158]
 #                       -SEL[131]-GBY[132]-EVENT[133]
 # TS[19]-FIL[101]-SEL[21]-MAPJOIN[159]-GBY[29]-RS[30]-GBY[31]-SEL[32]-RS[45]-MAPJOIN[162]
 # TS[22]-FIL[102]-SEL[24]-RS[26]-MAPJOIN[159]
 #                       -SEL[139]-GBY[140]-EVENT[141]
 # TS[49]-FIL[103]-SEL[51]-MAPJOIN[160]-GBY[59]-RS[60]-GBY[61]-SEL[62]-RS[64]-MAPJOIN[163]
 # TS[52]-FIL[104]-SEL[54]-RS[56]-MAPJOIN[160]
 #                       -SEL[147]-GBY[148]-EVENT[149]
 # 
 # 
 # DPP information stored in the cache: \{TS[19]=[EVENT[141]], TS[9]=[EVENT[133]], TS[49]=[RS[155],
EVENT[149]]}
{noformat}
 

The semi join branch in line 3 feeds into TS[49] in line 12 which feeds to MAPJOIN[163] going
back to parent of the semi join branch at line 2.


The logic to detect cycle may fail as there is a MAPJOIN[160] at line 12 which could cause
the logic to look for wrong TS. The logic to find TS operator upstream must use findOperatorsUpstream()
and examine each TS Op for complete coverage.

 Simplified image of task-cycle, without operator cycles - http://people.apache.org/~gopalv/HIVE_20252_cycle1.svg

And the artificial edge introduced to trigger cycle detection (in red) - http://people.apache.org/~gopalv/HIVE_20252_cycle_fix.svg

cc [~jcamachorodriguez]

  was:
For eg,

 {noformat}
 # 2018-07-26T17:22:14,664 DEBUG [51377701-dc98-424f-82e0-bbb5d6c84316 main] optimizer.SharedWorkOptimizer:
Before SharedWorkOptimizer:
 # TS[0]-FIL[96]-SEL[2]-MAPJOIN[156]-MAPJOIN[157]-MAPJOIN[161]-MAPJOIN[162]-FIL[47]-SEL[48]-MAPJOIN[163]-FIL[66]-SEL[67]-TNK[105]-GBY[68]-RS[69]-GBY[70]-SEL[71]-RS[72]-SEL[73]-LIM[74]-FS[75]
 #                                                          
-SEL[142]-GBY[143]-RS[144]-GBY[145]-RS[155]
 # TS[3]-FIL[97]-SEL[5]-RS[34]-MAPJOIN[156]
 # TS[6]-FIL[98]-SEL[8]-RS[37]-MAPJOIN[157]
 # TS[9]-FIL[99]-SEL[11]-MAPJOIN[158]-GBY[40]-RS[42]-MAPJOIN[161]
 # TS[12]-FIL[100]-SEL[14]-RS[16]-MAPJOIN[158]
 #                       -SEL[131]-GBY[132]-EVENT[133]
 # TS[19]-FIL[101]-SEL[21]-MAPJOIN[159]-GBY[29]-RS[30]-GBY[31]-SEL[32]-RS[45]-MAPJOIN[162]
 # TS[22]-FIL[102]-SEL[24]-RS[26]-MAPJOIN[159]
 #                       -SEL[139]-GBY[140]-EVENT[141]
 # TS[49]-FIL[103]-SEL[51]-MAPJOIN[160]-GBY[59]-RS[60]-GBY[61]-SEL[62]-RS[64]-MAPJOIN[163]
 # TS[52]-FIL[104]-SEL[54]-RS[56]-MAPJOIN[160]
 #                       -SEL[147]-GBY[148]-EVENT[149]
 # 
 # 
 # DPP information stored in the cache: \{TS[19]=[EVENT[141]], TS[9]=[EVENT[133]], TS[49]=[RS[155],
EVENT[149]]}
{noformat}
 

The semi join branch in line 3 feeds into TS[49] in line 12 which feeds to MAPJOIN[163] going
back to parent of the semi join branch at line 2.


The logic to detect cycle may fail as there is a MAPJOIN[160] at line 12 which could cause
the logic to look for wrong TS. The logic to find TS operator upstream must use findOperatorsUpstream()
and examine each TS Op for complete coverage.

 Simplified image of task-cycle, without operator cycles - http://people.apache.org/~gopalv/HIVE_20252_cycle1.svg

cc [~jcamachorodriguez]


> Semijoin Reduction : Cycles due to semi join branch may remain undetected if small table
side has a map join upstream.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-20252
>                 URL: https://issues.apache.org/jira/browse/HIVE-20252
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Deepak Jaiswal
>            Assignee: Deepak Jaiswal
>            Priority: Major
>         Attachments: HIVE-20252.1.patch, HIVE-20252.2.patch, HIVE-20252.3.patch, HIVE-20252.4.patch,
HIVE-20252.5.patch, HIVE-20252.6.patch
>
>
> For eg,
>  {noformat}
>  # 2018-07-26T17:22:14,664 DEBUG [51377701-dc98-424f-82e0-bbb5d6c84316 main] optimizer.SharedWorkOptimizer:
Before SharedWorkOptimizer:
>  # TS[0]-FIL[96]-SEL[2]-MAPJOIN[156]-MAPJOIN[157]-MAPJOIN[161]-MAPJOIN[162]-FIL[47]-SEL[48]-MAPJOIN[163]-FIL[66]-SEL[67]-TNK[105]-GBY[68]-RS[69]-GBY[70]-SEL[71]-RS[72]-SEL[73]-LIM[74]-FS[75]
>  #                                                        
  -SEL[142]-GBY[143]-RS[144]-GBY[145]-RS[155]
>  # TS[3]-FIL[97]-SEL[5]-RS[34]-MAPJOIN[156]
>  # TS[6]-FIL[98]-SEL[8]-RS[37]-MAPJOIN[157]
>  # TS[9]-FIL[99]-SEL[11]-MAPJOIN[158]-GBY[40]-RS[42]-MAPJOIN[161]
>  # TS[12]-FIL[100]-SEL[14]-RS[16]-MAPJOIN[158]
>  #                       -SEL[131]-GBY[132]-EVENT[133]
>  # TS[19]-FIL[101]-SEL[21]-MAPJOIN[159]-GBY[29]-RS[30]-GBY[31]-SEL[32]-RS[45]-MAPJOIN[162]
>  # TS[22]-FIL[102]-SEL[24]-RS[26]-MAPJOIN[159]
>  #                       -SEL[139]-GBY[140]-EVENT[141]
>  # TS[49]-FIL[103]-SEL[51]-MAPJOIN[160]-GBY[59]-RS[60]-GBY[61]-SEL[62]-RS[64]-MAPJOIN[163]
>  # TS[52]-FIL[104]-SEL[54]-RS[56]-MAPJOIN[160]
>  #                       -SEL[147]-GBY[148]-EVENT[149]
>  # 
>  # 
>  # DPP information stored in the cache: \{TS[19]=[EVENT[141]], TS[9]=[EVENT[133]], TS[49]=[RS[155],
EVENT[149]]}
> {noformat}
>  
> The semi join branch in line 3 feeds into TS[49] in line 12 which feeds to MAPJOIN[163]
going back to parent of the semi join branch at line 2.
> The logic to detect cycle may fail as there is a MAPJOIN[160] at line 12 which could
cause the logic to look for wrong TS. The logic to find TS operator upstream must use findOperatorsUpstream()
and examine each TS Op for complete coverage.
>  Simplified image of task-cycle, without operator cycles - http://people.apache.org/~gopalv/HIVE_20252_cycle1.svg
> And the artificial edge introduced to trigger cycle detection (in red) - http://people.apache.org/~gopalv/HIVE_20252_cycle_fix.svg
> cc [~jcamachorodriguez]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message