hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17486) Enable SharedWorkOptimizer in tez on HOS
Date Tue, 05 Dec 2017 07:59:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278167#comment-16278167
] 

liyunzhang commented on HIVE-17486:
-----------------------------------

Here record the problems currently I met
1. I want to change the M->R to M->M->R and split the operator tree when encountering
TS. I create [SparkRuleDispatcher| https://github.com/kellyzly/hive/blob/HIVE-17486.3/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkRuleDispatcher.java]
to apply rules to the operator tree, the reason why i don't use DefaultRuleDispatcher is because
there already a rule called [Handle Analyze Command|https://github.com/kellyzly/hive/blob/jdk9-trial/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L432]
to split operator trees once encountering TS. Original [SparkCompiler#opRules|https://github.com/kellyzly/hive/blob/jdk9-trial/ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java#L417]
is a linkedHashMap which stores one key with one value. It can not deal with the case where
one key with two values. So current solution is to modify SparkCompile#opRules to a Multimap
and create SparkRuleDispatcher . But I am afraid once encountering TS, only 1 rule will be
applied  for {{SparkRuleDispatcher#dispatch}}

SparkRuleDispatcher#dispatch
{code}

@Override
  public Object dispatch(Node nd, Stack<Node> ndStack, Object... nodeOutputs)
      throws SemanticException {

    // find the firing rule
    // find the rule from the stack specified
    Rule rule = null;
    int minCost = Integer.MAX_VALUE;
    for (Rule r : procRules.keySet()) {
      int cost = r.cost(ndStack);
      if ((cost >= 0) && (cost < minCost)) {
        minCost = cost;
        // Here I am afraid there is only 1 rule will be applied even there are two rules
for TS
        rule = r;
      }
    }

    Collection<NodeProcessor> procSet;

    if (rule == null) {
      procSet = defaultProcSet;
    } else {
      procSet = procRules.get(rule);
    }

    // Do nothing in case proc is null
    Object ret = null;
    for (NodeProcessor proc : procSet) {
      if (proc != null) {
        // Call the process function
        ret = proc.process(nd, ndStack, procCtx, nodeOutputs);
      }
    }
    return ret;
  }
{code}

I can change above code like following but don't know return the result of which rule if there
are more than 1 rule for TS.
{code}
  @Override
  public Object dispatch(Node nd, Stack<Node> ndStack, Object... nodeOutputs)
      throws SemanticException {

    // find the firing rule
    // find the rule from the stack specified
    ArrayList ruleList =new ArrayList();
    int minCost = Integer.MAX_VALUE;
    for (Rule r : procRules.keySet()) {
      int cost = r.cost(ndStack);
      if ((cost >= 0) && (cost < minCost)) {
        minCost = cost;
        ruleList.add(r);
      }
    }

    Collection<NodeProcessor> procSet;

    if (ruleList.size() == 0) {
      procSet = defaultProcSet;
    } else {
      for(Rule r: ruleList) {
        // Question: Here I don't know which rule I should use if there is more than 1 rule
in the ruleList
        procSet = procRules.get(r);
      }
    }

    // Do nothing in case proc is null
    Object ret = null;
    for (NodeProcessor proc : procSet) {
      if (proc != null) {
        // Call the process function
        ret = proc.process(nd, ndStack, procCtx, nodeOutputs);
      }
    }
    return ret;

  }
}

{code}
[~lirui], [~xuefuz] can you give your suggestions about the problem?


> Enable SharedWorkOptimizer in tez on HOS
> ----------------------------------------
>
>                 Key: HIVE-17486
>                 URL: https://issues.apache.org/jira/browse/HIVE-17486
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang
>            Assignee: liyunzhang
>         Attachments: HIVE-17486.1.patch, explain.28.share.false, explain.28.share.true,
scanshare.after.svg, scanshare.before.svg
>
>
> in HIVE-16602, Implement shared scans with Tez.
> Given a query plan, the goal is to identify scans on input tables that can be merged
so the data is read only once. Optimization will be carried out at the physical level.  In
Hive on Spark, it caches the result of spark work if the spark work is used by more than 1
child spark work. After sharedWorkOptimizer is enabled in physical plan in HoS, the identical
table scans are merged to 1 table scan. This result of table scan will be used by more 1 child
spark work. Thus we need not do the same computation because of cache mechanism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message