tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manu Zhang <owenzhang1...@gmail.com>
Subject Re: How to decide number of partitions in a Map Output
Date Fri, 24 Jan 2014 03:59:04 GMT
So that's what *ShuffleVertexManager.determineParallelismAndApply* does ?


On Fri, Jan 24, 2014 at 11:35 AM, Gopal Vijayaraghavan <gopalv@apache.org>wrote:

> On Thu, Jan 23, 2014 at 7:25 PM, Manu Zhang <owenzhang1990@gmail.com>
> wrote:
>
> > Another question is that, when running Hive over Tez, why is the number
> of
> > reducers not the same as that of Hive over MR, provided with the same
> input
> > data and configurations ?
>
> This is too big a note to write out into an email
>
> http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/
>
> But in short, unlike MR, Tez enables you to set reducer count dynamically.
>
> Cheers,
> Gopal
>
> > On Tue, Jan 21, 2014 at 10:30 AM, Bikas Saha <bikas@hortonworks.com>
> wrote:
> >
> >> It would help us understand your situation if you could give a short
> >> description of how you changes are going to speed up Map output. If it
> is
> >> generally useful then we could consider adding it the existing library
> of
> >> inputs and outputs.
> >>
> >> The number of physical outputs == the number of downstream consumers of
> >> that partitioned data. Think of them as the number of reducers. So if
> >> there are N reducers (and thus you will be partitioning the data N ways)
> >> then number of physical output == N.
> >>
> >> The MRPartitioner config item (tez.runtime.num.expected.partitions) is
> >> used to communicate the above information to the MRPartitioner. In the
> >> above example you would set it to N so that the MRPartitioner would
> >> partition the data N ways.
> >>
> >> Hive uses internal statistics to calculate the expected number of
> >> partitions at compile time. However once Hive determines the number of
> >> tasks (say reducers) then the partitioner will always get that value for
> >> the number of partitions to create.
> >>
> >> Not sure what you mean by partitionId will exceed that number when
> running
> >> some jobs. Can you please elaborate? Do you mean that the
> >> partitioner.getPartition(Key, Value, Partition) is getting a value for
> >> Partition > num physical outputs? In that case, please check your
> >> customized output code because that should be the one calling the
> >> getPartition() method.
> >>
> >> Bikas
> >>
> >> -----Original Message-----
> >> From: Manu Zhang [mailto:owenzhang1990@gmail.com]
> >> Sent: Monday, January 20, 2014 6:10 PM
> >> To: dev@tez.incubator.apache.org
> >> Subject: How to decide number of partitions in a Map Output
> >>
> >> Hi all,
> >>
> >> I've been working on a customized Output which works like
> >> OnFileSortedOutput but with optimizations that will speed up Map output.
> >>
> >> The issue is about the *number of partitions*. My current implementation
> >> is set it to number of physicalOutputs but the *partitionId will exceed
> >> that
> >> number* when runnning some jobs.
> >>
> >> After referring to  MRPartitioner, I found the number of partition is
> set
> >> to "tez.runtime.num.expected.partitions" (or 1 if null) . So what is the
> >> difference between that property and physicalOutputs ?
> >>
> >> Also , when running Hive queries over Tez (with my customized output), a
> >> Hive property "hive.exec.reducers.bytes.per.reducer" could also alter
> the
> >> number of partitions, according to my observation.
> >>
> >> Any ideas ?
> >> Thanks
> >>
> >> Manu Zhang
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> entity to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> immediately
> >> and delete it from your system. Thank You.
> >>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message