hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang_intel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]
Date Fri, 23 Jun 2017 02:35:01 GMT

    [ https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060336#comment-16060336
] 

liyunzhang_intel commented on HIVE-11297:
-----------------------------------------

[~csun]: about the questions you mentioned in RB. there are two queries are different.  
explain query1( please use the attached hive-site.xml to verify, without the configuration
in hive-site.xml,  i can not reproduce following explain)
{code}
set hive.execution.engine=spark; 
set hive.spark.dynamic.partition.pruning=true;
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.strict.checks.cartesian.product=false;
explain select count(*) from srcpart join srcpart_date on (srcpart.ds = srcpart_date.ds) join
srcpart_hour on (srcpart.hr = srcpart_hour.hr) 
where srcpart_date.`date` = '2008-04-08' and srcpart_hour.hour = 11 and srcpart.hr = 11
{code}
previous explain 
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:2
      Vertices:
        Map 7 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date
                  filterExpr: ((date = '2008-04-08') and ds is not null) (type: boolean)
                  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and ds is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column stats:
NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column
stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column
stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: ds
                            Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column
stats: NONE
                            target column name: ds
                            target work: Map 1

  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 5 (PARTITION-LEVEL SORT, 2)
        Reducer 3 <- Map 6 (PARTITION-LEVEL SORT, 2), Reducer 2 (PARTITION-LEVEL SORT,
2)
        Reducer 4 <- Reducer 3 (GROUP, 1)
      DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: srcpart
                  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats:
NONE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats:
NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column
stats: NONE
                      value expressions: _col1 (type: string)
        Map 5 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date
                  filterExpr: ((date = '2008-04-08') and ds is not null) (type: boolean)
                  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and ds is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column stats:
NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column
stats: NONE
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: srcpart_hour
                  filterExpr: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) = 11.0)) (type:
boolean)
                  Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) = 11.0)) (type:
boolean)
                    Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: hr (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:
NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column
stats: NONE
        Reducer 2 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string)
                  1 _col0 (type: string)
                outputColumnNames: _col1
                Statistics: Num rows: 1 Data size: 12786 Basic stats: COMPLETE Column stats:
NONE
                Reduce Output Operator
                  key expressions: _col1 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col1 (type: string)
                  Statistics: Num rows: 1 Data size: 12786 Basic stats: COMPLETE Column stats:
NONE
        Reducer 3 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col1 (type: string)
                  1 _col0 (type: string)
                Statistics: Num rows: 1 Data size: 14064 Basic stats: COMPLETE Column stats:
NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                    value expressions: _col0 (type: bigint)
        Reducer 4 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

{code}

current explain
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:2
      Vertices:
        Map 7 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date
                  filterExpr: ((date = '2008-04-08') and ds is not null) (type: boolean)
                  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and ds is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column stats:
NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column
stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column
stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: ds
                            Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column
stats: NONE
                            target column name: ds
                            target work: Map 1
                 Map 8 
            Map Operator Tree:
                TableScan
                  alias: srcpart_hour
                  filterExpr: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) = 11.0)) (type:
boolean)
                  Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) = 11.0)) (type:
boolean)
                    Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: hr (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:
NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column
stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column
stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: hr
                            Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column
stats: NONE
                            target column name: hr
                            target work: Map 1Best Regards


  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 5 (PARTITION-LEVEL SORT, 2)
        Reducer 3 <- Map 6 (PARTITION-LEVEL SORT, 2), Reducer 2 (PARTITION-LEVEL SORT,
2)
        Reducer 4 <- Reducer 3 (GROUP, 1)
      DagName: root_20170622213734_eb4c35e8-952a-4c4d-8972-ba5381bf51a3:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: srcpart
                  Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats:
NONE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column stats:
NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 1 Data size: 11624 Basic stats: PARTIAL Column
stats: NONE
                      value expressions: _col1 (type: string)
        Map 5 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date
                  filterExpr: ((date = '2008-04-08') and ds is not null) (type: boolean)
                  Statistics: Num rows: 2 Data size: 42 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and ds is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column stats:
NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 21 Basic stats: COMPLETE Column
stats: NONE
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: srcpart_hour
                  filterExpr: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) = 11.0)) (type:
boolean)
                  Statistics: Num rows: 2 Data size: 10 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((UDFToDouble(hour) = 11.0) and (UDFToDouble(hr) = 11.0)) (type:
boolean)
                    Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: hr (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:
NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column
stats: NONE
        Reducer 2 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string)
                  1 _col0 (type: string)
                outputColumnNames: _col1
                Statistics: Num rows: 1 Data size: 12786 Basic stats: COMPLETE Column stats:
NONE
                Reduce Output Operator
                  key expressions: _col1 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col1 (type: string)
                  Statistics: Num rows: 1 Data size: 12786 Basic stats: COMPLETE Column stats:
NONE
        Reducer 3 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col1 (type: string)
                  1 _col0 (type: string)
                Statistics: Num rows: 1 Data size: 14064 Basic stats: COMPLETE Column stats:
NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                    value expressions: _col0 (type: bigint)
        Reducer 4 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

the difference between previous and now is an extra Map8.  After quering the history of spark_dynamic_partiton_pruning.q.out,
I found in {{commit 42216997f7fcff1853524b03e3961ec5c21f3fd7}}, the Map8 exists. After {{commit
677e5d20109e31203129ef5090c8989e9bb7c366}} the Map8 is removed.  
The reason that cause the Map8 is removed in {{commit 677e5d20109e31203129ef5090c8989e9bb7c366}}
is because the [filterDesc|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/DynamicPartitionPruningOptimization.java#L128
] changed, in  {{commit 677e5d20109e31203129ef5090c8989e9bb7c366}}, filterDesc is {noformat}((ds)
IN (RS[10]) and true) (type: boolean){noformat} in latest version its value is {noformat}Filter:
((ds) IN (RS[5]) and (hr) IN (RS[9])) (type: boolean)){noformat}  that's why there is no Map8
in  {{commit 677e5d20109e31203129ef5090c8989e9bb7c366}}.  In fact, Map8 should exist because
we should build dpp operator tree for {{srcpart_date}} and {{srcpart_hour}} as {{srcpart}}
has two partitions(srcpart.ds and srcpart.hr).  But i will not  deep investigate why Map8
is lost in  {{commit 677e5d20109e31203129ef5090c8989e9bb7c366}}.

> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-11297
>                 URL: https://issues.apache.org/jira/browse/HIVE-11297
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: spark-branch
>            Reporter: Chao Sun
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, HIVE-11297.3.patch, HIVE-11297.4.patch,
HIVE-11297.5.patch, HIVE-11297.6.patch, HIVE-11297.7.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates partition
info for more than one partition columns, multiple operator trees are created, which all start
from the same table scan op, but have different spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do table scan
multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message