tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Partition input
Date Fri, 31 Jul 2015 19:44:32 GMT
Hi,

There is a way around this, because the data doesn¹t move by a Tez edge,
there¹s no reason to actually use an edge partitioner.

https://github.com/t3rmin4t0r/tpcds-partitioner/blob/master/src/main/java/o
rg/notmysock/tpcds/ParTable.java#L198


But that¹s the terminal Vertex case.

If you want to send data via HDFS between vertices, you still need an HCFS
Edge, which sends data via a filesystem & URI locations via events.

Cheers,
Gopal

On 7/31/15, 9:42 AM, "Siddharth Seth" <sseth@apache.org> wrote:

>At the moment, using either the OrderedPartitionedKVOutput or
>UnorderedPartitionKVOutput along with MROutput (assuming you want the data
>on HDFS) is the best way to do this.
>There's no variant of MROutput which supports partitioning. If something
>like this were to be added - it would need to figure out how to generate
>the partitioned files correctly - since each task and output file would
>end
>up with multiple partitions.
>
>On Mon, Jul 27, 2015 at 9:47 AM, Oleg Zhurakousky <
>ozhurakousky@hortonworks.com> wrote:
>
>> Guys
>>
>> I have a simple DAG where I simply want to partition the input data. In
>> theory this should not require more then a single Vertex (read splits
>>and
>> write them to individual partitions). IN other words a Vertex with
>> Datasource and DataSink.
>> However, it appears unless I have a vertex sending its output to a
>> OrderedPartitionedKVOutput, partitioner is not being called and the
>>output
>> goes to a single partition.
>>
>> Any pointers?
>> Cheers
>> Oleg
>>
>>



Mime
View raw message