hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-20380) LLAP cache should cache small buffers more efficiently
Date Fri, 24 Aug 2018 18:13:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-20380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sergey Shelukhin updated HIVE-20380:
------------------------------------
    Summary: LLAP cache should cache small buffers more efficiently  (was: explore storing
multiple CBs in a single cache buffer in LLAP cache)

> LLAP cache should cache small buffers more efficiently
> ------------------------------------------------------
>
>                 Key: HIVE-20380
>                 URL: https://issues.apache.org/jira/browse/HIVE-20380
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>
> Lately ORC CBs are becoming ridiculously small. First there's the 4Kb minimum (instead
of 256Kb), then after we moved metadata cache off-heap, the index streams that are all tiny
take up a lot of CBs and waste space. 
> Wasted space can require larger cache and lead to cache OOMs on some workloads.
> Reducing min.alloc solves this problem, but then there's a lot of heap (and probably
compute) overhead to track all these buffers. Arguably even the 4Kb min.alloc is too small.
> We should store contiguous CBs in the same buffer; to start, we can do it for ROW_INDEX
streams. That probably means reading all ROW_INDEX streams instead of doing projection when
we see that they are too small.
> We need to investigate what the pattern is for ORC data blocks. One option is to increase
min.alloc and then consolidate multiple 4-8Kb CBs, but only for the same stream. However larger
min.alloc will result in wastage for really small streams, so we can also consolidate multiple
streams (potentially across columns) if needed. This will result in some priority anomalies
but they probably ok.
> Another consideration is making tracking less object oriented, in particular passing
around integer indexes instead of objects and storing state in giant arrays somewhere (potentially
with some optimizations for less common things), instead of every buffers getting its own
object. 
> cc [~gopalv] [~prasanth_j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message