hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pillis W <pillis.w...@gmail.com>
Subject Re: Skip bad records when streaming supported?
Date Thu, 13 Apr 2017 22:24:41 GMT
Thanks Daniel.

Please correct me if I have understood this incorrectly, but according to
the documentation at
http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records
, it seemed like the sole purpose of this functionality is to tolerate
unknown failures/exceptions in mappers/reducers. If I was able to catch all
failures, I do not need to even use this ability - is that not true?

If I have understood it incorrectly, when would one use the feature to skip
bad records?

Regards,
PW




On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton <daniel@cloudera.com>
wrote:

> You have to modify wordcount-mapper-t1.py to just ignore the bad line.  In
> the worst case, you should be able to do something like:
>
> for line in sys.stdin:
>   try:
>     # Insert processing code here
>   except:
>     # Error processing record, ignore it
>     pass
>
> Daniel
>
>
> On 4/13/17 1:33 PM, Pillis W wrote:
>
>> Hello,
>> I am using 'hadoop-streaming.jar' to do a simple word count, and want to
>> skip records that fail execution. Below is the actual command I run, and
>> the mapper always fails on one record, and hence fails the job. The input
>> file is 3 lines with 1 bad line.
>>
>> hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name
>> =SkipTest
>> -Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
>> -Dmapreduce.reduce.skip.maxgroups=1
>> -Dmapreduce.map.skip.proc.count.autoincr=false
>> -Dmapreduce.reduce.skip.proc.count.autoincr=false -D
>> mapred.reduce.tasks=1
>> -D mapred.map.tasks=1 -files
>> /home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordc
>> ount-reducer-t1.py
>> -input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
>> -mapper "python wordcount-mapper-t1.py" -reducer "python
>> wordcount-reducer-t1.py"
>>
>>
>> I was wondering if skipping of records is supported when MapReduce is used
>> in streaming mode?
>>
>> Thanks in advance.
>> PW
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message