[ https://issues.apache.org/jira/browse/HIVE-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15754322#comment-15754322
]
Jesus Camacho Rodriguez commented on HIVE-15122:
------------------------------------------------
[~ashutoshc], could you review this patch?
For new test case, PK-FK inference can be checked in the logs.
For that particular case, stats without patch:
{code}
Statistics: Num rows: 889 Data size: 7112 Basic stats: COMPLETE Column stats: COMPLETE
{code}
While stats with patch:
{code}
Statistics: Num rows: 964 Data size: 7712 Basic stats: COMPLETE Column stats: COMPLETE
{code}
> Hive: Upcasting types should not obscure stats (min/max/ndv)
> ------------------------------------------------------------
>
> Key: HIVE-15122
> URL: https://issues.apache.org/jira/browse/HIVE-15122
> Project: Hive
> Issue Type: Bug
> Reporter: Siddharth Seth
> Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-15122.patch
>
>
> A UDFToLong breaks PK/FK inferences and triggers mis-estimation of joins in LLAP.
> Snippet from the bad plan.
> {code}
> | STAGE PLANS:
|
> | Stage: Stage-1
|
> | Tez
|
> | DagId: hive_20161031222730_a700058f-78eb-40d6-a67d-43add60a50e2:6
|
> | Edges:
|
> | Map 2 <- Map 1 (BROADCAST_EDGE)
|
> | Map 3 <- Map 2 (BROADCAST_EDGE)
|
> | Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE), Map 7 (CUSTOM_SIMPLE_EDGE), Map
8 (BROADCAST_EDGE), Map 9 (BROADCAST_EDGE) |
> | Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
|
> | Reducer 6 <- Reducer 5 (SIMPLE_EDGE)
|
> | DagName:
|
> | Vertices:
|
> | Map 1
|
> | Map Operator Tree:
|
> | TableScan
|
> | alias: supplier
|
> | filterExpr: (s_suppkey is not null and s_nationkey is not null) (type:
boolean) |
> | Statistics: Num rows: 10000000 Data size: 160000000 Basic stats:
COMPLETE Column stats: COMPLETE |
> | Filter Operator
|
> | predicate: (s_suppkey is not null and s_nationkey is not null)
(type: boolean) |
> | Statistics: Num rows: 10000000 Data size: 160000000 Basic stats:
COMPLETE Column stats: COMPLETE |
> | Select Operator
|
> | expressions: s_suppkey (type: bigint), s_nationkey (type: bigint)
|
> | outputColumnNames: _col0, _col1
|
> | Statistics: Num rows: 10000000 Data size: 160000000 Basic stats:
COMPLETE Column stats: COMPLETE |
> | Reduce Output Operator
|
> | key expressions: _col0 (type: bigint)
|
> | sort order: +
|
> | Map-reduce partition columns: _col0 (type: bigint)
|
> | Statistics: Num rows: 10000000 Data size: 160000000 Basic stats:
COMPLETE Column stats: COMPLETE |
> | value expressions: _col1 (type: bigint)
|
> | Execution mode: vectorized, llap
|
> | LLAP IO: all inputs
|
> | Map 2
|
> | Map Operator Tree:
|
> | TableScan
|
> | alias: lineitem
|
> | filterExpr: (l_suppkey is not null and l_orderkey is not null) (type:
boolean) |
> | Statistics: Num rows: 2285121364 Data size: 63983407882 Basic stats:
COMPLETE Column stats: PARTIAL |
> | Filter Operator
|
> | predicate: (l_suppkey is not null and l_orderkey is not null) (type:
boolean) |
> | Statistics: Num rows: 2285121364 Data size: 127966796384 Basic
stats: COMPLETE Column stats: PARTIAL |
> | Select Operator
|
> | expressions: l_orderkey (type: bigint), l_suppkey (type: int),
l_extendedprice (type: double), l_discount (type: double), l_shipdate (type: date) |
> | outputColumnNames: _col0, _col1, _col2, _col3, _col4
|
> | Statistics: Num rows: 2285121364 Data size: 127966796384 Basic
stats: COMPLETE Column stats: PARTIAL |
> | Map Join Operator
|
> | condition map:
|
> | Inner Join 0 to 1
|
> | keys:
|
> | 0 _col0 (type: bigint)
|
> | 1 UDFToLong(_col1) (type: bigint)
|
> | outputColumnNames: _col1, _col2, _col4, _col5, _col6
|
> | input vertices:
|
> | 0 Map 1
|
> | Statistics: Num rows: 10000000 Data size: 880000000 Basic stats:
COMPLETE Column stats: PARTIAL |
> | Reduce Output Operator
|
> | key expressions: _col2 (type: bigint)
|
> | sort order: +
|
> | Map-reduce partition columns: _col2 (type: bigint)
|
> | Statistics: Num rows: 10000000 Data size: 880000000 Basic
stats: COMPLETE Column stats: PARTIAL |
> | value expressions: _col1 (type: bigint), _col4 (type: double),
_col5 (type: double), _col6 (type: date) |
> | Execution mode: vectorized, llap
|
> | LLAP IO: all inputs
|
> | Map 3
|
> | Map Operator Tree:
|
> | TableScan
|
> | alias: orders
|
> | filterExpr: (o_orderkey is not null and o_custkey is not null) (type:
boolean) |
> | Statistics: Num rows: 4318801126 Data size: 51825626753 Basic stats:
COMPLETE Column stats: NONE |
> | Filter Operator
|
> | predicate: (o_orderkey is not null and o_custkey is not null) (type:
boolean) |
> | Statistics: Num rows: 4318801126 Data size: 51825626753 Basic stats:
COMPLETE Column stats: NONE |
> | Select Operator
|
> | expressions: o_orderkey (type: int), o_custkey (type: bigint)
|
> | outputColumnNames: _col0, _col1
|
> | Statistics: Num rows: 4318801126 Data size: 51825626753 Basic
stats: COMPLETE Column stats: NONE |
> | Map Join Operator
|
> | condition map:
|
> | Inner Join 0 to 1
|
> | keys:
|
> | 0 _col2 (type: bigint)
|
> | 1 UDFToLong(_col0) (type: bigint)
|
> | outputColumnNames: _col1, _col4, _col5, _col6, _col8
|
> | input vertices:
|
> | 0 Map 2
|
> | Statistics: Num rows: 4750681341 Data size: 57008190663 Basic
stats: COMPLETE Column stats: NONE |
> | Reduce Output Operator
|
> | key expressions: _col8 (type: bigint)
|
> | sort order: +
|
> | Map-reduce partition columns: _col8 (type: bigint)
|
> | Statistics: Num rows: 4750681341 Data size: 57008190663 Basic
stats: COMPLETE Column stats: NONE |
> | value expressions: _col1 (type: bigint), _col4 (type: double),
_col5 (type: double), _col6 (type: date) |
> | Execution mode: vectorized, llap
|
> | LLAP IO: all inputs
|
> | Map 7
> {code}
> Note the Map2 to Map3 output.
> This causes a rather large join (120GB) to be categorized as a map-join.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
|