tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3215) Support for MultipleOutputs
Date Sat, 16 Apr 2016 02:46:25 GMT
Ming Ma created TEZ-3215:
----------------------------

             Summary: Support for MultipleOutputs
                 Key: TEZ-3215
                 URL: https://issues.apache.org/jira/browse/TEZ-3215
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Ming Ma


Here is the use case. A reducer might write its output to more than one file. The file name
will be based on the mapper key. We don't know all possible keys ahead of time. In MR, MultipleOutputs
provides such support. I couldn't find anything readily available in Tez.

* Set up one DataSink per file ahead of time won't work as we don't know all possible keys.
* Use MR MultipleOutputs directly from the Tez application processor. It isn't clear how to
pass TaskInputOutputContext to MultipleOutputs.
* Tez MROutput can create a DataSink based on the specified outputFormat. But it can't take
MR MultipleOutputs.

I end up modifying Tez MROutput with HashMap {{recordWriters}} to achieve this. If this is
a solved problem, can anyone explain how to do it?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message