hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ádám Szita (Jira) <j...@apache.org>
Subject [jira] [Updated] (HIVE-23947) Cache affinity is unset for text files read by LLAP
Date Tue, 04 Aug 2020 12:44:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ádám Szita updated HIVE-23947:
------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed
           Status: Resolved  (was: Patch Available)

> Cache affinity is unset for text files read by LLAP
> ---------------------------------------------------
>
>                 Key: HIVE-23947
>                 URL: https://issues.apache.org/jira/browse/HIVE-23947
>             Project: Hive
>          Issue Type: Bug
>          Components: llap
>            Reporter: Ádám Szita
>            Assignee: Ádám Szita
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> LLAP relies on HostAffinitySplitLocationProvider to route the same splits to always the
same LLAP daemons. By having such consistent split of data among the nodes we can gain a good
hit ratio and thus good performance.
> For text files this is almost never granted: HostAffinitySplitLocationProvider is never
used, because HS2 does not set the cache affinity flag in the job conf for text inputformat
content during compile. The launched Tez AM will have to rely on HDFS location information
to route the splits (and therefore tasks) to the executor nodes. This location information
might not have a good overlap with where the actual daemons are, or in S3 case, the Tez AM
will mostly choose executors in a random way.
> This in turn will result in the hit ratio hardly reaching 100%, each time we re-run the
same query, some disk/s3 read will still occur. That is until the same content gets populated
into all the daemons (after running the query tens or hundreds of times) causing poor performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message