hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-20380) explore storing multiple CBs in a single cache buffer in LLAP cache
Date Fri, 24 Aug 2018 01:02:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-20380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591009#comment-16591009
] 

Sergey Shelukhin edited comment on HIVE-20380 at 8/24/18 1:01 AM:
------------------------------------------------------------------

After trying various approaches I think since this will anyway involve memory copying and
interleaving buffers, what needs to happen instead is that we need to decrease allocation
size after decompression. 
That is much simpler than having a separate cache and consolidating CBs into a single buffer,
doing partial cache matches, adding offsets to LlapDataBuffer-s, etc.

One issue is that, for small cache-wide table case, where the entire cache can become locked,
it's not helpful to replace the fully locked cache of 128Kb buffers with 4Kb of data each,
with 4Kb buffers sitting in cache every 128Kb. You still cannot get 128Kb. So, we'd have to
move data. We will not have multiple CBs per the Java buffer object, but merely change allocations
so small CBs don't use large cache buffers.

If we do this shrinking before putting data into cache, then unlike regular cache defragmentation,
which is complex, we have a set of already locked buffers that are also invisible to anyone
else, so we can trivially consolidate within all the buffers allocated by a read, that noone
can touch in any way, and free up some large buffers completely and also some parts of the
smaller buffers (e.g. if we have 10 ROW_INDEX streams, each with <4Kb of data, but sitting
in 128Kb allocs because the ORC file CB size is 128Kb, we can create 10 4Kb buffers within
one of those 10, and straight up deallocate 9 remaining 128Kb buffers, plus the 64Kb + 16Kb
+ 8Kb in the first one). We can also do an extra step (e.g. if we have a single 4Kb-of-data-128Kb-alloc)
of allocating a small buffer explicitly (without defragmentation, and with a flag to not split
buffers larger than the original for this - no point in creating a 4Kb buffer out of another
128Kb of empty space for this example), and copying there before deallocating the big one.
That will be able to pick up all the crumbs created by other consolidations like the one above.
Without splitting and retries the allocation can be cheap and safe.
This will be controlled by a waste threshold setting.

Unfortunately this will do slightly less than nothing at all for Hive 2 without the defrag
patch. But, if we backport the defrag patch (pending) this will also work for Hive 2.

I may not be able to work on this to completion immediately so just posting a brain dump here
for reference. cc [~gopalv]


was (Author: sershe):
After trying various approaches I think since this will anyway involve memory copying and
interleaving buffers, what needs to happen instead is that we need to decrease allocation
size after decompression. Which won't move data, either. However, for small cache, wide table
case, where the entire cache can become locked, it's not helpful to replace the fully locked
cache of 128Kb buffers with 4Kb of data each with 4Kb buffers sitting in cache every 128Kb.
So, we'd have to move data. We will not have multiple CBs per the Java buffer object, but
merely change allocations so small CBs don't use large cache buffers.

If we do this shrinking before putting data into cache, then unlike regular cache defragmentation,
which is complex, we have a set of already locked buffers that are also invisible to anyone
else, so we can trivially consolidate within all the buffers allocated by a read, that noone
can touch in any way, and free up some large buffers completely and also some parts of the
smaller buffers (i.e. if we have 10 ROW_INDEX streams, each with <4Kb of data, but sitting
in 128Kb allocs because the ORC file CB size is 128Kb, we can create 10 4Kb buffers within
one of those 10, and straight up deallocate 9 remaining 128Kb buffers, plus the 64Kb + 16Kb
+ 8Kb in the first one). We can also do an extra step (e.g. if we have a single 4Kb-of-data-128Kb-alloc)
of allocating a small buffer explicitly (without defragmentation, and with a flag to not split
buffers larger than the original for this - no point in creating a 4Kb buffer out of another
128Kb of empty space for this example), and copying there before deallocating the big one.
That will be able to pick up all the crumbs created by other consolidations like the one above.
Without splitting and retries the allocation can be cheap and safe.
This will be controlled by a waste threshold setting.

Unfortunately this will do slightly less than nothing at all for Hive 2 without the defrag
patch. But, if we backport the defrag patch (pending) this will also work for Hive 2.

I may not be able to work on this to completion immediately so just posting a brain dump here
for reference. cc [~gopalv]

> explore storing multiple CBs in a single cache buffer in LLAP cache
> -------------------------------------------------------------------
>
>                 Key: HIVE-20380
>                 URL: https://issues.apache.org/jira/browse/HIVE-20380
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>
> Lately ORC CBs are becoming ridiculously small. First there's the 4Kb minimum (instead
of 256Kb), then after we moved metadata cache off-heap, the index streams that are all tiny
take up a lot of CBs and waste space. 
> Wasted space can require larger cache and lead to cache OOMs on some workloads.
> Reducing min.alloc solves this problem, but then there's a lot of heap (and probably
compute) overhead to track all these buffers. Arguably even the 4Kb min.alloc is too small.
> We should store contiguous CBs in the same buffer; to start, we can do it for ROW_INDEX
streams. That probably means reading all ROW_INDEX streams instead of doing projection when
we see that they are too small.
> We need to investigate what the pattern is for ORC data blocks. One option is to increase
min.alloc and then consolidate multiple 4-8Kb CBs, but only for the same stream. However larger
min.alloc will result in wastage for really small streams, so we can also consolidate multiple
streams (potentially across columns) if needed. This will result in some priority anomalies
but they probably ok.
> Another consideration is making tracking less object oriented, in particular passing
around integer indexes instead of objects and storing state in giant arrays somewhere (potentially
with some optimizations for less common things), instead of every buffers getting its own
object. 
> cc [~gopalv] [~prasanth_j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message