hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Templeton <dan...@cloudera.com>
Subject Re: Skip bad records when streaming supported?
Date Thu, 13 Apr 2017 21:49:27 GMT
You have to modify wordcount-mapper-t1.py to just ignore the bad line.  
In the worst case, you should be able to do something like:

for line in sys.stdin:
     # Insert processing code here
     # Error processing record, ignore it


On 4/13/17 1:33 PM, Pillis W wrote:
> Hello,
> I am using 'hadoop-streaming.jar' to do a simple word count, and want to
> skip records that fail execution. Below is the actual command I run, and
> the mapper always fails on one record, and hence fails the job. The input
> file is 3 lines with 1 bad line.
> hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name=SkipTest
> -Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
> -Dmapreduce.reduce.skip.maxgroups=1
> -Dmapreduce.map.skip.proc.count.autoincr=false
> -Dmapreduce.reduce.skip.proc.count.autoincr=false -D mapred.reduce.tasks=1
> -D mapred.map.tasks=1 -files
> /home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
> -input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
> -mapper "python wordcount-mapper-t1.py" -reducer "python
> wordcount-reducer-t1.py"
> I was wondering if skipping of records is supported when MapReduce is used
> in streaming mode?
> Thanks in advance.
> PW

To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org

View raw message