tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hitesh Shah <hit...@apache.org>
Subject Re: Tez compatibility with MR
Date Fri, 10 Jan 2014 23:45:33 GMT
Hi Jonathan 

Most of the points below summarize the views of the current Tez devs:

We have tried to get a lot of MR aspects to work on Tez but not to full completion as most
of the current contributors moved on to focus on other aspects of Tez. 
With respect to MR, some features we have not gotten around to or may not even be aware of.
And there are some minor things that may not make sense for Tez to support.

Broad categories/missing features:

i) Job History: The plan is to use YARN Application History/Timeline to create Tez specific
history. There is no history/UI support at the moment whether for a running AM or post job
ii) Recovery: In the works in conjunction with the above history implementation 
iii) Configuration knobs: we have run a set of MR system tests and fixed a bunch of compatibility
issues seen. I am sure, as more folks try MR on Tez, we will discover more gaps. 
iv) Task run-time: There are still minor issues which probably need to be addressed. For example,
TEZ-637 to set all the required bits needed by MR components. Progress support is not fully
functional as of now.
v) Speculation: No one has started work on this yet.
iv) Command-line tools: 

Taking a look at bin/mapred job:

	[-submit <job-file>]
	[-status <job-id>]
	[-counter <job-id> <group-name> <counter-name>]
	[-kill <job-id>]
	[-set-priority <job-id> <priority>]. Valid values for priorities are: VERY_HIGH
	[-events <job-id> <from-event-#> <#-of-events>]
	[-history <jobHistoryFile>]
	[-list [all]]
	[-list-attempt-ids <job-id> <task-type> <task-state>]. Valid values for
<task-type> are REDUCE MAP. Valid values for <task-state> are running, completed
	[-kill-task <task-attempt-id>]
	[-fail-task <task-attempt-id>]
	[-logs <job-id> <task-attempt-id>]

By running MR tasks within the Tez context, there is obviously quite some information lost.
This is a big gap currently - partially as we have not looked at it and also as a open design
question as to what should be supported. For example, today, there is no support for a task
to provide any general update information back to the AM ( which could then be exposed to
the client). It becomes a tricky question as to what an overall task state means when a task
consists of a single processor, multiple inputs and multiple outputs.  

We will take any help we can get :). If you are specifically looking at MR compatibility,
the above list can get you started. Or you can start by trying to run your existing MR jobs
against Tez and looking at bugs/features gaps. 

-- Hitesh

On Jan 10, 2014, at 11:19 AM, Jonathan Eagles wrote:

> I have seen some comments on missing functionality in Tez such as
> "MapReduce on Tez is not 100% compatible with traditional MapReduce -
> example the functionality available on the JobClient to track individual
> tasks is missing."
> It's not quite clear to me at this point all the missing pieces and whether
> those are design limitations or just not enough hands to get to them all
> due to other more pressing priorities. If the latter, I'd be happy to help
> out to add these or other features if there is need.
> jeagles

View raw message