tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Grandl <rgra...@yahoo.com.INVALID>
Subject Re: Data manipulation in Hive over Tez
Date Fri, 02 Dec 2016 01:40:36 GMT

 Thanks Rajesh for your answer. That was really helpful. 

I would like to ask you few more questions. I am trying to better understand how the <Key,
Value> pairs are propagated and processed at various vertices. 

Edge:- encodes the data movement logic
Processing logic:- process and partition the output key space according to its logic- also
the processing logic in every stage follows a sequence of operators through which every key,
value pair is passed
My questions are:1)I am a bit confused till what extent the processing logic in a stage goes
in (especially Reduce Tasks).  Like, given an input in terms of <Key, Value> pairs
what are typical patterns of processing logic i.e. what kind of <Key, Value> pairs it
can produce and how much changes can the vertex do. 
This question is a bit confusing, but basically I am trying to understand what kind of patterns
of  input {<Key, Value>, output <Key, Value>} patterns can be handled in general
by a typical processing logic for SQL queries written in Hive atop Tez. 

2) Can't really wrap up my head how much connection exists between data movement encoded in
edges and how the <Key, Value> pairs are generated by a vertex and moved to corresponding
downstream vertices.

Thanks again for your answers,Robert


 On Tuesday, November 29, 2016 4:04 AM, Rajesh Balamohan <rajesh.balamohan@gmail.com>

 Hi Robert,

1. At high level, you can refer to https://github.com/apache/
exec/tez/DagUtils.java where different vertices, edges etc gets created as
per the execution plan.
Consider a vertex as a combination of input, processing logic and output.
Different vertices are connected together by edges which can define the
data movement logic (broadcast or scatter-gather or one-to-one etc).
In the edge configuration, type of key/value class is defined. This DAG is
submitted to Tez for execution.

2. For task processing, you can refer to https://github.com/apache/
exec/tez/TezProcessor.java in hive side.

3. In Tez side, there are different type of inputs and outputs available.
E.g OrderedGroupedKVInput, UnorderedKVInput, OrderedPartitionedKVOutput,
UnorderedKVOutput, UnorderedPartitionedKVOutput etc are available for
reading/writing data.

For instance, ordered output would write the data in sorted format. There
are different type of sorters available in Tez which can be chosen at
runtime (DefaultSorter, PipelinedSorter). Intermedate data of tasks are
written in
"IFile" format which is similar to the IFile format in MR world, but has
more optimizations involved in tez impl.


As far as the reading is concerned, key/value class and serializer
information is passed on as a part of creating the DAG. E.g


On Sat, Nov 26, 2016 at 5:13 AM, Robert Grandl <rgrandl@yahoo.com.invalid>

> Hi guys,
> I am not sure where is the right place to post this question hence I send
> it to both hive and tez dev mailing lists.
> I am trying to get a better understanding of how the input / output for a
> task is handled.  Typically input stages read the data to be processed.
> Next, all the data will flow in forms of key / value pairs till the end of
> the job's execution.
> 1. Could you guys can point me out to the key files where I should look to
> identify that? I am mostly interested to intercept where data is read by a
> task and wher the data is written after the task process the input  data.
> 2. Also, is there a way I can identify the types (and hence read the
> actual values) of a key / value pair instead of just Object key, Object
> value?
> Thanks in advance,Robert


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message