hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Mollitor (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-7194) New Method For CombineFile
Date Mon, 18 Mar 2019 17:18:00 GMT
David Mollitor created MAPREDUCE-7194:
-----------------------------------------

             Summary: New Method For CombineFile
                 Key: MAPREDUCE-7194
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7194
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: mrv2
    Affects Versions: 3.2.0
            Reporter: David Mollitor
            Assignee: David Mollitor


Rhe {{CombineFileInputFormat}} class is responsible for grouping blocks together to form larger
splits.  The current implementation is very naive.  It iterates over the list of available
blocks and as long as the current group of blocks is less than the maximum split size, it
will keep added blocks.  The check for if a split has reached its maximum size happens *after*
each block is added.  For example given a certain maximum "M", and two blocks which are both
7/8M, they will be grouped together to create a split which is 14/8M.  If M is a large number,
this split will be very large and not what the operator would expect.

I'll propose a general clean up and also, enforcing that, unless a files cannot be split,
that its splits will not be larger than the configured maximum size.  This will provide operators
a much more straight-forward way of calculating the expected number of splits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Mime
View raw message