hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Misha Dmitriev (JIRA)" <>
Subject [jira] [Commented] (HIVE-15882) HS2 generating high memory pressure with many partitions and concurrent queries
Date Thu, 23 Feb 2017 20:58:44 GMT


Misha Dmitriev commented on HIVE-15882:

I've measured how much memory is saved with my change. It turned out that it's actually more
difficult/time-consuming to obtain the "threshold" number of concurrent requests that my benchmark
can sustain with the same small heap (500M). So I switched to a different approach. I set
the heap size to a high number (3G), sufficient for my benchmark to pass without any GC issues
with or without my changes. Then I ran it first without, then with my changes, measuring the
live set of the heap size every 4 sec. That is, the size of live objects immediately after
full GC. I've done it using the following script:

while [ true ] ; do
  # Force full GC 
  sudo -u hive jmap -histo:live $PID > /dev/null
  # Get the total amount of memory used
  sudo -u hive jstat -gc $PID | tail -n 1 | awk '{split($0,a," "); sum=a[3]+a[4]+a[6]+a[8];
print sum}'
  sleep 4

Then I checked the highest number printed by this script, i.e. the biggest live heap size
when running my benchmark. I ended up with:

1173M - without my changes
743M - with my changes

That means that my changes (String interning plus interning Properties objects in PartitionDesc,
which will be posted in a separate patch) collectively save 37% of memory.

> HS2 generating high memory pressure with many partitions and concurrent queries
> -------------------------------------------------------------------------------
>                 Key: HIVE-15882
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
> I've created a Hive table with 2000 partitions, each backed by two files, with one row
in each file. When I execute some number of concurrent queries against this table, e.g. as
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:10000 -n admin -p admin -e
"select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 server with
-Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray ( analysis of a heap dump that was
generated in the 50queries/500m heap scenario. It suggests that there are several opportunities
to reduce memory pressure with not very invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With String.intern()
calls added in the ~10 relevant places in the code, this overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used collections (see section
8). There are many maps and lists that are either empty or have just 1 element. By modifying
the code that creates and populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It looks like these
objects are highly duplicate, since for each Partition each concurrently running query creates
its own copy of Partion, PartitionDesc and Properties. Thus we have nearly 100,000 (50 queries
* 2,000 partitions) Properties in memory. By interning/deduplicating these objects we may
be able to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption in this scenario
by ~40%.

This message was sent by Atlassian JIRA

View raw message