tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bikas Saha <bi...@hortonworks.com>
Subject RE: How to decide number of partitions in a Map Output
Date Fri, 24 Jan 2014 05:09:05 GMT
This feature is turned off by default in Tez. So unless you have turned it
on (or hive turns it on), the number of reducers is probably different
because latest Hive compiler tries to determine the number of reducers via
table statistics at compile time. It may be that this Hive compiler
feature is on by default for Hive-on-Tez compilation but off by default
for Hive-on-MR compilation.

Bikas

-----Original Message-----
From: Hitesh Shah [mailto:hitesh@apache.org]
Sent: Thursday, January 23, 2014 8:27 PM
To: dev@tez.incubator.apache.org
Subject: Re: How to decide number of partitions in a Map Output

Yes. It looks at the size of outputs generated by the tasks in the
previous vertex and determines the no. of tasks to run based on the
configured amount of input data size per task.

-- Hitesh

On Jan 23, 2014, at 7:59 PM, Manu Zhang wrote:

> So that's what *ShuffleVertexManager.determineParallelismAndApply* does
?
>
>
> On Fri, Jan 24, 2014 at 11:35 AM, Gopal Vijayaraghavan
<gopalv@apache.org>wrote:
>
>> On Thu, Jan 23, 2014 at 7:25 PM, Manu Zhang <owenzhang1990@gmail.com>
>> wrote:
>>
>>> Another question is that, when running Hive over Tez, why is the
>>> number
>> of
>>> reducers not the same as that of Hive over MR, provided with the
>>> same
>> input
>>> data and configurations ?
>>
>> This is too big a note to write out into an email
>>
>> http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/
>>
>> But in short, unlike MR, Tez enables you to set reducer count
dynamically.
>>
>> Cheers,
>> Gopal
>>
>>> On Tue, Jan 21, 2014 at 10:30 AM, Bikas Saha <bikas@hortonworks.com>
>> wrote:
>>>
>>>> It would help us understand your situation if you could give a
>>>> short description of how you changes are going to speed up Map
>>>> output. If it
>> is
>>>> generally useful then we could consider adding it the existing
>>>> library
>> of
>>>> inputs and outputs.
>>>>
>>>> The number of physical outputs == the number of downstream
>>>> consumers of that partitioned data. Think of them as the number of
>>>> reducers. So if there are N reducers (and thus you will be
>>>> partitioning the data N ways) then number of physical output == N.
>>>>
>>>> The MRPartitioner config item (tez.runtime.num.expected.partitions)
>>>> is used to communicate the above information to the MRPartitioner.
>>>> In the above example you would set it to N so that the
>>>> MRPartitioner would partition the data N ways.
>>>>
>>>> Hive uses internal statistics to calculate the expected number of
>>>> partitions at compile time. However once Hive determines the number
>>>> of tasks (say reducers) then the partitioner will always get that
>>>> value for the number of partitions to create.
>>>>
>>>> Not sure what you mean by partitionId will exceed that number when
>> running
>>>> some jobs. Can you please elaborate? Do you mean that the
>>>> partitioner.getPartition(Key, Value, Partition) is getting a value
>>>> for Partition > num physical outputs? In that case, please check
>>>> your customized output code because that should be the one calling
>>>> the
>>>> getPartition() method.
>>>>
>>>> Bikas
>>>>
>>>> -----Original Message-----
>>>> From: Manu Zhang [mailto:owenzhang1990@gmail.com]
>>>> Sent: Monday, January 20, 2014 6:10 PM
>>>> To: dev@tez.incubator.apache.org
>>>> Subject: How to decide number of partitions in a Map Output
>>>>
>>>> Hi all,
>>>>
>>>> I've been working on a customized Output which works like
>>>> OnFileSortedOutput but with optimizations that will speed up Map
output.
>>>>
>>>> The issue is about the *number of partitions*. My current
>>>> implementation is set it to number of physicalOutputs but the
>>>> *partitionId will exceed that
>>>> number* when runnning some jobs.
>>>>
>>>> After referring to  MRPartitioner, I found the number of partition
>>>> is
>> set
>>>> to "tez.runtime.num.expected.partitions" (or 1 if null) . So what
>>>> is the difference between that property and physicalOutputs ?
>>>>
>>>> Also , when running Hive queries over Tez (with my customized
>>>> output), a Hive property "hive.exec.reducers.bytes.per.reducer"
>>>> could also alter
>> the
>>>> number of partitions, according to my observation.
>>>>
>>>> Any ideas ?
>>>> Thanks
>>>>
>>>> Manu Zhang
>>>>
>>>> --
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>> entity to
>>>> which it is addressed and may contain information that is
>>>> confidential, privileged and exempt from disclosure under
>>>> applicable law. If the
>> reader
>>>> of this message is not the intended recipient, you are hereby
>>>> notified
>> that
>>>> any printing, copying, dissemination, distribution, disclosure or
>>>> forwarding of this communication is strictly prohibited. If you
>>>> have received this communication in error, please contact the
>>>> sender
>> immediately
>>>> and delete it from your system. Thank You.
>>>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or
>> entity to which it is addressed and may contain information that is
>> confidential, privileged and exempt from disclosure under applicable
>> law. If the reader of this message is not the intended recipient, you
>> are hereby notified that any printing, copying, dissemination,
>> distribution, disclosure or forwarding of this communication is
>> strictly prohibited. If you have received this communication in
>> error, please contact the sender immediately and delete it from your
system. Thank You.
>>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message