hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <>
Subject [jira] [Commented] (HIVE-19206) Automatic memory management for open streaming writers
Date Fri, 27 Apr 2018 09:47:00 GMT


Prasanth Jayachandran commented on HIVE-19206:

[~gopalv] Please discard the RB request. RB request includes HIVE-19211 + this patch so it
will look big. 

> Automatic memory management for open streaming writers
> ------------------------------------------------------
>                 Key: HIVE-19206
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Streaming
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>            Priority: Major
>         Attachments: HIVE-19206.1.patch
> Problem:
>  When there are 100s of record updaters open, the amount of memory required by orc writers
keeps growing because of ORC's internal buffers. This can lead to potential high GC or OOM
during streaming ingest.
> Solution:
>  The high level idea is for the streaming connection to remember all the open record
updaters and flush the record updater periodically (at some interval). Records written to
each record updater can be used as a metric to determine the candidate record updaters for
>  If stripe size of orc file is 64MB, the default memory management check happens only
after every 5000 rows which may which may be too late when there are too many concurrent writers
in a process. Example case would be 100 writers open and each of them have almost full stripe
of 64MB buffered data, this would take 100*64MB ~=6GB of memory. When all of the record writers
flush, the memory usage drops down to 100*~2MB which is just ~200MB memory usage.

This message was sent by Atlassian JIRA

View raw message