tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shiri Marron <Shiri.Mar...@amdocs.com>
Subject RE: Problem when running our code with tez
Date Sun, 30 Aug 2015 09:20:07 GMT

-----Original Message-----
From: Hersh Shafer 
Sent: Thursday, August 27, 2015 11:45 AM
To: Daniel Dai; dev@tez.apache.org; dev@pig.apache.org; Shiri Marron
Cc: Almog Shunim
Subject: RE: Problem when running our code with tez


-----Original Message-----
From: Daniel Dai [mailto:daijy@hortonworks.com]
Sent: Wednesday, August 26, 2015 1:57 AM
To: dev@tez.apache.org; dev@pig.apache.org
Cc: Hersh Shafer; Almog Shunim
Subject: Re: Problem when running our code with tez

JobID is vague is Tez, you shall use dagId instead. However, I don¹t see a way you can get
DagId within RecordWriter/OutputCommitter. A possible solution is to use conf.get(³mapreduce.workflow.id²)
+ conf.get(³mapreduce.workflow.node.name²). Note both are Pig specific configuration and
only applicable if you run with Pig.


On 8/25/15, 2:08 PM, "Hitesh Shah" <hitesh@apache.org> wrote:

>+dev@pig as this might be a question better answered by Pig developers.
>This probably won¹t answer your question but should give you some 
>background info. When Pig uses Tez, it may end up running multiple dags 
>within the same YARN application therefore the ³jobId² ( in case of MR, 
>job Id maps to the application Id from YARN ) may not be unique.
>Furthermore, there are cases where multiple vertices within the same 
>DAG could write to HDFS hence both dagId and vertexId are required to 
>guarantee uniqueness when writing to a common location.
>< Hitesh
>On Aug 25, 2015, at 7:29 AM, Shiri Marron <Shiri.Marron@amdocs.com> wrote:
>> Hi,
>> We are trying to run our existing workflows that contains pig 
>>scripts, on tez (version, hdp 2.2) , but we are 
>>facing some problems when we run our code with tez.
>> In our code, we are writing and reading from/to a temp directory 
>>which we create with a name based on the  jobID:
>>     Part 1-  We extend org.apache.hadoop.mapreduce.RecordWriter and 
>>in the close() -we take the jobID from TaskAttemptContext context.
>>Meaning, each task writes a file to
>>           this  directory in the close () method according to the 
>>jobID from the context.
>>    Part 2 -  In the end of the whole job (after all the tasks were 
>>completed), we have our custom outputCommitter (which extends the
>>               org.apache.hadoop.mapreduce.OutputCommitter), and in 
>>commitJob()  it looks for that directory of the job and handles all 
>>the files under it-  the jobID is taken from JobContext
>> We noticed that when we use tez, this mechanism doesn't work since 
>>the jobID from the tez task (part one ) is combined from the original 
>>id+vertex id , for example: 14404914675610 instead of 1440491467561 . 
>>the directory name in part 2 is different than part 1.
>> We looked for a way to retrieve only the vertex id or only the job id 
>>, but didn't find one - on the configuration the  property:
>> mapreduce.job.id also had the addition of the vertex id, and no other 
>>property value was equal to the original job id.
>> Can you please advise how can we solve this issue?  Is there a way to 
>>get the original jobID when we're in part 1?
>> Regards,
>> Shiri Marron
>> Amdocs
>> This message and the information contained herein is proprietary and 
>>confidential and subject to the Amdocs policy statement,  you may 
>>review at http://www.amdocs.com/email_disclaimer.asp

This message and the information contained herein is proprietary and confidential and subject
to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

View raw message