hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-16602) Implement shared scans with Tez
Date Wed, 01 Nov 2017 05:32:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16233668#comment-16233668
] 

liyunzhang commented on HIVE-16602:
-----------------------------------

[~jcamachorodriguez]:  I indeed found the performance improvement in a not very good hw env(less
hard disk and less memory). I guess this is because shared scans optimization can help reduce
duplicated table scan because table scan may take long time in a bad hw env. for example(DS/[query28.sql|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/perf/query28.q]
before the shared scan optimization, the explain
{code}
Vertex dependency in root stage
Reducer 11 <- Map 10 (SIMPLE_EDGE)
Reducer 13 <- Map 12 (SIMPLE_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 11 (CUSTOM_SIMPLE_EDGE), Reducer 13 (CUSTOM_SIMPLE_EDGE), Reducer
2 (CUSTOM_SIMPLE_EDGE), Reducer 5 (CUSTOM_SIMPLE_EDGE), Reducer 7 (CUSTOM_SIMPLE_EDGE), Reducer
9 (CUSTOM_SIMPLE_EDGE)
Reducer 5 <- Map 4 (SIMPLE_EDGE)
Reducer 7 <- Map 6 (SIMPLE_EDGE)
Reducer 9 <- Map 8 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
    limit:100
    Stage-1
      Reducer 3
      File Output Operator [FS_51]
        Limit [LIM_50] (rows=1 width=2497)
          Number of rows:100
          Select Operator [SEL_49] (rows=1 width=2497)
            Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
            Merge Join Operator [MERGEJOIN_58] (rows=1 width=2497)
              Conds:(Inner),(Inner),(Inner),(Inner),(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
            <-Reducer 11 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_46]
                Group By Operator [GBY_33] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 10 [SIMPLE_EDGE]
                  SHUFFLE [RS_32]
                    Group By Operator [GBY_31] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_30] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_56] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 11 AND 15 and (ss_list_price BETWEEN
66 AND 76 or ss_coupon_amt BETWEEN 920 AND 1920 or ss_wholesale_cost BETWEEN 4 AND 24))
                          TableScan [TS_28] (rows=575995635 width=88)
                            default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 13 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_47]
                Group By Operator [GBY_40] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 12 [SIMPLE_EDGE]
                  SHUFFLE [RS_39]
                    Group By Operator [GBY_38] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_37] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_57] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 6 AND 10 and (ss_list_price BETWEEN
91 AND 101 or ss_coupon_amt BETWEEN 1430 AND 2430 or ss_wholesale_cost BETWEEN 32 AND 52))
                          TableScan [TS_35] (rows=575995635 width=88)
                            default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 2 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_42]
                Group By Operator [GBY_5] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_4]
                    Group By Operator [GBY_3] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_2] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_52] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 0 AND 5 and (ss_list_price BETWEEN
11 AND 21 or ss_coupon_amt BETWEEN 460 AND 1460 or ss_wholesale_cost BETWEEN 14 AND 34))
                          TableScan [TS_0] (rows=575995635 width=88)
                            default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 5 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_43]
                Group By Operator [GBY_12] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 4 [SIMPLE_EDGE]
                  SHUFFLE [RS_11]
                    Group By Operator [GBY_10] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_9] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_53] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 26 AND 30 and (ss_list_price BETWEEN
28 AND 38 or ss_coupon_amt BETWEEN 2513 AND 3513 or ss_wholesale_cost BETWEEN 42 AND 62))
                          TableScan [TS_7] (rows=575995635 width=88)
                            default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 7 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_44]
                Group By Operator [GBY_19] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 6 [SIMPLE_EDGE]
                  SHUFFLE [RS_18]
                    Group By Operator [GBY_17] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_16] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_54] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 21 AND 25 and (ss_list_price BETWEEN
135 AND 145 or ss_coupon_amt BETWEEN 14180 AND 15180 or ss_wholesale_cost BETWEEN 38 AND 58))
                          TableScan [TS_14] (rows=575995635 width=88)
                            default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 9 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_45]
                Group By Operator [GBY_26] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 8 [SIMPLE_EDGE]
                  SHUFFLE [RS_25]
                    Group By Operator [GBY_24] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_23] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_55] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 16 AND 20 and (ss_list_price BETWEEN
142 AND 152 or ss_coupon_amt BETWEEN 3054 AND 4054 or ss_wholesale_cost BETWEEN 80 AND 100))
                          TableScan [TS_21] (rows=575995635 width=88)
                            default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
{code}

After the shared scan optimization, explain
{code}
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE), Reducer 4 (CUSTOM_SIMPLE_EDGE), Reducer 5
(CUSTOM_SIMPLE_EDGE), Reducer 6 (CUSTOM_SIMPLE_EDGE), Reducer 7 (CUSTOM_SIMPLE_EDGE), Reducer
8 (CUSTOM_SIMPLE_EDGE)
Reducer 4 <- Map 1 (SIMPLE_EDGE)
Reducer 5 <- Map 1 (SIMPLE_EDGE)
Reducer 6 <- Map 1 (SIMPLE_EDGE)
Reducer 7 <- Map 1 (SIMPLE_EDGE)
Reducer 8 <- Map 1 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
    limit:100
    Stage-1
      Reducer 3
      File Output Operator [FS_51]
        Limit [LIM_50] (rows=1 width=2497)
          Number of rows:100
          Select Operator [SEL_49] (rows=1 width=2497)
            Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
            Merge Join Operator [MERGEJOIN_58] (rows=1 width=2497)
              Conds:(Inner),(Inner),(Inner),(Inner),(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14","_col15","_col16","_col17"]
            <-Reducer 2 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_42]
                Group By Operator [GBY_5] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_4]
                    Group By Operator [GBY_3] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_2] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_52] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 0 AND 5 and (ss_list_price BETWEEN
11 AND 21 or ss_coupon_amt BETWEEN 460 AND 1460 or ss_wholesale_cost BETWEEN 14 AND 34))
                          TableScan [TS_0] (rows=575995635 width=88)
                            default@store_sales,store_sales,Tbl:COMPLETE,Col:NONE,Output:["ss_quantity","ss_wholesale_cost","ss_list_price","ss_coupon_amt"]
            <-Reducer 4 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_43]
                Group By Operator [GBY_12] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_11]
                    Group By Operator [GBY_10] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_9] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_53] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 26 AND 30 and (ss_list_price BETWEEN
28 AND 38 or ss_coupon_amt BETWEEN 2513 AND 3513 or ss_wholesale_cost BETWEEN 42 AND 62))
                           Please refer to the previous TableScan [TS_0]
            <-Reducer 5 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_44]
                Group By Operator [GBY_19] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_18]
                    Group By Operator [GBY_17] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_16] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_54] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 21 AND 25 and (ss_list_price BETWEEN
135 AND 145 or ss_coupon_amt BETWEEN 14180 AND 15180 or ss_wholesale_cost BETWEEN 38 AND 58))
                           Please refer to the previous TableScan [TS_0]
            <-Reducer 6 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_45]
                Group By Operator [GBY_26] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_25]
                    Group By Operator [GBY_24] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_23] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_55] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 16 AND 20 and (ss_list_price BETWEEN
142 AND 152 or ss_coupon_amt BETWEEN 3054 AND 4054 or ss_wholesale_cost BETWEEN 80 AND 100))
                           Please refer to the previous TableScan [TS_0]
            <-Reducer 7 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_46]
                Group By Operator [GBY_33] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_32]
                    Group By Operator [GBY_31] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_30] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_56] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 11 AND 15 and (ss_list_price BETWEEN
66 AND 76 or ss_coupon_amt BETWEEN 920 AND 1920 or ss_wholesale_cost BETWEEN 4 AND 24))
                           Please refer to the previous TableScan [TS_0]
            <-Reducer 8 [CUSTOM_SIMPLE_EDGE]
              PARTITION_ONLY_SHUFFLE [RS_47]
                Group By Operator [GBY_40] (rows=1 width=416)
                  Output:["_col0","_col1","_col2"],aggregations:["avg(VALUE._col0)","count(VALUE._col1)","count(DISTINCT
KEY._col0:0._col0)"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_39]
                    Group By Operator [GBY_38] (rows=21333171 width=88)
                      Output:["_col0","_col1","_col2","_col3"],aggregations:["avg(ss_list_price)","count(ss_list_price)","count(DISTINCT
ss_list_price)"],keys:ss_list_price
                      Select Operator [SEL_37] (rows=21333171 width=88)
                        Output:["ss_list_price"]
                        Filter Operator [FIL_57] (rows=21333171 width=88)
                          predicate:(ss_quantity BETWEEN 6 AND 10 and (ss_list_price BETWEEN
91 AND 101 or ss_coupon_amt BETWEEN 1430 AND 2430 or ss_wholesale_cost BETWEEN 32 AND 52))
                           Please refer to the previous TableScan [TS_0]

{code}

Before there are 6 Maps, After there are only 1 Map. In After explain, there are  6 edges
from Map1 to Reducer 2, Reducer 4, Reducer 5, Reducer 6, Reducer 7 and Reducer 8. Here i want
to ask question whether  these edges can be shared so the table scan can be reused or not?

> Implement shared scans with Tez
> -------------------------------
>
>                 Key: HIVE-16602
>                 URL: https://issues.apache.org/jira/browse/HIVE-16602
>             Project: Hive
>          Issue Type: New Feature
>          Components: Physical Optimizer
>    Affects Versions: 3.0.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: HIVE-16602.01.patch, HIVE-16602.02.patch, HIVE-16602.03.patch, HIVE-16602.04.patch,
HIVE-16602.patch
>
>
> Given a query plan, the goal is to identify scans on input tables that can be merged
so the data is read only once. Optimization will be carried out at the physical level.
> In the longer term, identification of equivalent expressions and reutilization of intermediary
results should be done at the logical layer via Spool operator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message