tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bikas Saha <bi...@hortonworks.com>
Subject RE: Automatic Reducer Parallelism
Date Sun, 16 Mar 2014 22:58:53 GMT
VertexManager plugin is set on the vertex via the DAG API. Since it's a
logical user concept this must be set by the user. We currently internally
set ShuffleVertexManager plugin whenever the plugin is not set and there
is a scatter-gather edge. This is going to change and it will be necessary
to set this.

So to turn off this behavior (say when doing range partitioning) the
ShuffleVertexManager should not be set on that vertex (probably set a
different manager)

There is a payload associated with each vertex manager when specified on
the DAG API. So each ShuffleVertexManager can be configured differently
using its own payload.

Not much planned other than improvments to the heuristic as and when
something shows up.

Hive does its own parallelism calculation using its stats during
compilation. It should be enabling ARP but AFAIK it has not yet done so.

Bikas

-----Original Message-----
From: Rohini Palaniswamy [mailto:rohini.aditya@gmail.com]
Sent: Sunday, March 16, 2014 12:10 PM
To: dev@tez.incubator.apache.org
Subject: Automatic Reducer Parallelism

Hi,
   I was looking at configuring ARP for Pig on Tez. My understanding of
what is available currently is:

  ShuffleVertexManager is the one that currently supports auto
parallelism.
If TEZ_AM_SHUFFLE_VERTEX_MANAGER_ENABLE_AUTO_PARALLEL is set to true, then
based on TEZ_AM_SHUFFLE_VERTEX_MANAGER_DESIRED_TASK_INPUT_SIZE and
TEZ_AM_SHUFFLE_VERTEX_MANAGER_MIN_TASK_PARALLELISM, parallelism is
computed based on stats from some of the completed map tasks after the
slow start threshold for reducers kick in and reducer tasks are started.


Questions:
    1) Since it is a AM level setting,looks like it is possible to say do
not apply auto parallelism for this vertex. Is that correct?  Pig has a
PARALLEL clause which allows users to set parallelism for a particular
operation like JOIN, GROUP BY or ORDER BY. We would like to honor that and
use automatic parallelism only for operations where user has not defined
PARALLEL.  Also when there is a custom partitioner involved (like range
partitioning in case of order by) we do not want ARP to kick in. Is it
possible to turn on or off ARP per vertex?
    2) How is ARP used in hive?
    3) Any other things we need to know about ARP? Any new optimizations
or changes planned?

Regards,
Rohini

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message