hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Templeton <dan...@cloudera.com>
Subject Re: Skip bad records when streaming supported?
Date Thu, 13 Apr 2017 22:28:37 GMT
To quote the docs:

---
This feature can be used when map/reduce tasks crashes deterministically 
on certain input. This happens due to bugs in the map/reduce function. 
The usual course would be to fix these bugs. But sometimes this is not 
possible; perhaps the bug is in third party libraries for which the 
source code is not available. Due to this, the task never reaches to 
completion even with multiple attempts and complete data for that task 
is lost.

With this feature, only a small portion of data is lost surrounding the 
bad record, which may be acceptable for some user applications. see 
setMapperMaxSkipRecords(Configuration, long)
---

Basically, it's a heavy-handed approach that you should only use as a 
last resort.

Daniel


On 4/13/17 3:24 PM, Pillis W wrote:
> Thanks Daniel.
>
> Please correct me if I have understood this incorrectly, but according 
> to the documentation at 
> http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records

> , it seemed like the sole purpose of this functionality is to tolerate 
> unknown failures/exceptions in mappers/reducers. If I was able to 
> catch all failures, I do not need to even use this ability - is that 
> not true?
>
> If I have understood it incorrectly, when would one use the feature to 
> skip bad records?
>
> Regards,
> PW
>
>
>
>
> On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton <daniel@cloudera.com 
> <mailto:daniel@cloudera.com>> wrote:
>
>     You have to modify wordcount-mapper-t1.py to just ignore the bad
>     line.  In the worst case, you should be able to do something like:
>
>     for line in sys.stdin:
>       try:
>         # Insert processing code here
>       except:
>         # Error processing record, ignore it
>         pass
>
>     Daniel
>
>
>     On 4/13/17 1:33 PM, Pillis W wrote:
>
>         Hello,
>         I am using 'hadoop-streaming.jar' to do a simple word count,
>         and want to
>         skip records that fail execution. Below is the actual command
>         I run, and
>         the mapper always fails on one record, and hence fails the
>         job. The input
>         file is 3 lines with 1 bad line.
>
>         hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D
>         mapred.job.name <http://mapred.job.name>=SkipTest
>         -Dmapreduce.task.skip.start.at
>         <http://Dmapreduce.task.skip.start.at>tempts=1
>         -Dmapreduce.map.skip.maxrecords=1
>         -Dmapreduce.reduce.skip.maxgroups=1
>         -Dmapreduce.map.skip.proc.count.autoincr=false
>         -Dmapreduce.reduce.skip.proc.count.autoincr=false -D
>         mapred.reduce.tasks=1
>         -D mapred.map.tasks=1 -files
>         /home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
>         -input /user/hadoop/data/test1 -output
>         /user/hadoop/data/output-test-5
>         -mapper "python wordcount-mapper-t1.py" -reducer "python
>         wordcount-reducer-t1.py"
>
>
>         I was wondering if skipping of records is supported when
>         MapReduce is used
>         in streaming mode?
>
>         Thanks in advance.
>         PW
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail:
>     mapreduce-dev-unsubscribe@hadoop.apache.org
>     <mailto:mapreduce-dev-unsubscribe@hadoop.apache.org>
>     For additional commands, e-mail:
>     mapreduce-dev-help@hadoop.apache.org
>     <mailto:mapreduce-dev-help@hadoop.apache.org>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Mime
View raw message