tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3159) Reduce memory utilization while serializing keys and values
Date Wed, 09 Mar 2016 20:02:40 GMT
Rohini Palaniswamy created TEZ-3159:
---------------------------------------

             Summary: Reduce memory utilization while serializing keys and values
                 Key: TEZ-3159
                 URL: https://issues.apache.org/jira/browse/TEZ-3159
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Rohini Palaniswamy


  Currently DataOutputBuffer is used for serializing. The underlying buffer keeps doubling
in size when it reaches capacity. In some of the Pig scripts which serialize big bags, we
end up with OOM for some situations where mapreduce runs fine. 

    - When combiner runs in reducer and some of the fields after combining are still big bags
(For eg: distinct). Currently with mapreduce combiner does not run in reducer - MAPREDUCE-5221.
Since input sort buffers hold good amount of memory at that time it can easily go OOM
   -  While serializing output when there are multiple inputs and outputs and the sort buffers
for those take up space.

It is a pain especially after buffer size hits 128MB. Doubling at 128MB will require 128MB
(existing array) +256MB (new array). Any doubling after that requires even more space. But
most of the time the data is probably not going to fill up that 256MB leading to wastage.

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message