hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <>
Subject [jira] [Commented] (HIVE-20623) Shared work: Extend sharing of map-join cache entries in LLAP
Date Sun, 23 Sep 2018 15:34:00 GMT


Hive QA commented on HIVE-20623:

Here are the results of testing the latest attachment:

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 14993 tests executed
*Failed tests:*
org.apache.hadoop.hive.ql.metadata.TestHiveRemote.testPartition (batchId=296)
org.apache.hive.jdbc.TestJdbcWithMiniLlapArrow.testKillQuery (batchId=251)

Test results:
Console output:
Test logs:

Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed

This message is automatically generated.

ATTACHMENT ID: 12940945 - PreCommit-HIVE-Build

> Shared work: Extend sharing of map-join cache entries in LLAP
> -------------------------------------------------------------
>                 Key: HIVE-20623
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: llap, Logical Optimizer
>            Reporter: Gopal V
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>         Attachments: HIVE-20623.patch, hash-shared-work.json.txt, hash-shared-work.svg
> For a query like this
> {code}
> with all_sales as (
> select ss_customer_sk as customer_sk, ss_ext_list_price-ss_ext_discount_amt as ext_price
from store_sales
> select ws_bill_customer_sk as customer_sk, ws_ext_list_price-ws_ext_discount_amt as ext_price
from web_sales
> select cs_bill_customer_sk as customer_sk, cs_ext_sales_price - cs_ext_discount_amt as
ext_price from catalog_sales)
> select sum(ext_price) total_price, c_customer_id from all_sales, customer 
> where customer_sk = c_customer_sk
> group by c_customer_id
> order by total_price desc 
> limit 100;
> {code}
> The hashtable used for all 3 joins are identical, which is loaded 3x times in the same
LLAP instance because they are named.
> {code}
>     cacheKey = "HASH_MAP_" + this.getOperatorId() + "_container";
> {code}
> in the cache.
> If those are identical in nature (i.e vectorization, hashtable type etc), then the duplication
is just wasted CPU, memory and network - using the cache name for hashtables which will be
identical in layout would be extremely useful.
> In cases where the join is pushed through a UNION, those are identical.
> This optimization can only be done without concern for accidental delays when the same
upstream task is generating all of these hashtables, which is what is achieved by the shared
scan optimizer already.
> In case the shared work is not present, this has potential downsides - in case two customer
broadcasts were sourced from "Map 1" and "Map 2", the Map 1 builder will block the other task
from reading from Map 2, even though Map 2 might have started after, but finished ahead of
Map 1.
> So this specific optimization can always be considered for cases where the shared work
unifies the operator tree and the parents of all the RS entries involved are same (& the
RS layout is the same).

This message was sent by Atlassian JIRA

View raw message