hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erez Katz <erez_k...@yahoo.com>
Subject Re: mapred tmp work dir, streaming
Date Wed, 16 Dec 2009 02:35:48 GMT
Alright, some updates:

creating files in the local file system didn't end up in files on hdfs

on the other hand, the following snippet did work rather well:


fn =os.environ['TMPDIR']+'/TYPE_B.'+os.environ['mapred_task_partition']
    sys.stderr.write("File name is "  + fn + "\n")
    
k = open(fn , 'w')
k.write("ha ha ha ")
k.flush()
k.close()
cmd = os.environ['HADOOP_HOME']+"/bin/hadoop dfs -fs " + os.environ['fs_default_name'] + '
-put ' + fn +  ' ' +os.environ['mapred_work_output_dir']
    
sys.stderr.write("CMD " + cmd + "\n")
retcode = subprocess.call( cmd.split(' ' )  )
sys.stderr.write("recode = " + str(retcode) +"\n")


I suppose if I wanted to avoid writing to the local hard drive and then copy, I could start
a process with this command line  "/hadoop dfs -put - TARGET FILE" end write to the process'
input stream... 

opinions?

 Erez






--- On Tue, 12/15/09, Erez Katz <erez_katz@yahoo.com> wrote:

> From: Erez Katz <erez_katz@yahoo.com>
> Subject: mapred tmp work dir, streaming
> To: mapreduce-dev@hadoop.apache.org
> Date: Tuesday, December 15, 2009, 5:28 PM
> Hi,
> 
> I have a scenario where a reducer should generate two types
> of outputs, so instead of the standard 
> part-0000
> part-0001 
> etc.
> 
> I want 
> typeA-00000
> typeB-00000
> 
> typeA-00001
> typeB-00001
> 
> (I can still have around the part-0000* files, even if they
> are empty).
> 
> Currently I have some legacy code (could you imagine?
> hadoop legacy code already...), which uses the hdfs api (in
> c++) to write directly to the folder name returned by
> jobConf->get("mapred.work.output.dir")
> the file would be
> "typeA"+"-"+conf->get("mapred.task.partition")
> 
> The code is getting more and more cumbersome to maintain.
> Also, I would rather port that particular application to be
> a streaming application.
> 
> I noticed that in streaming the job conf parameters appear
> as environment variables, where the '.' are substituted by
> '_' (I run a streaming app where the mapper was 'env').
> 
> The following env variables caught my eye:
> 
> PWD=/home/hadoop/tmp/mapred/local/taskTracker/jobcache/job_200912141348_0118/attempt_200912141348_0118_m_000000_0/work     
> 
> TMPDIR=/home/hadoop/tmp/mapred/local/taskTracker/jobcache/job_200912141348_0118/attempt_200912141348_0118_m_000000_0/work/tmp   
>     
> 
> mapred_work_output_dir=hdfs://hcluster:8080/user/erkatz/osenviron/_temporary/_attempt_200912141348_0118_m_000000_0   
> 
> 
> job_local_dir=/home/hadoop/tmp/mapred/local/taskTracker/jobcache/job_200912141348_0118/work   
> 
> 
> 
> Now before I go on, building my app around things I dig up
> in this manner never makes me feel overly cozy, so if these
> variables are subject to change without notice, that would
> be a good thing to know.
> 
> I create a local file and then, form with in my reducer
> python (in that case) script call 'hadoop dfs -put' and put
> it in the value denoted by 'mapred_work_output_dir'
> 
> On the other hand, what if just created that file in the
> TMPDIR (either the one pointed byt TMPDIR or just ./tmp), do
> you think I can count on hadoop to copy what ever in that
> folder into the output folder on hdfs?
> 
> What do you think?
> 
> Thanks,
> 
>   Erez Katz
> 
> 
> 
>       
> 


      

Mime
View raw message