hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
Date Tue, 19 Nov 2013 17:45:37 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826729#comment-13826729
] 

Jason Lowe commented on HADOOP-9867:
------------------------------------

Ran across this JIRA while discussing the intricacies of HADOOP-9622.  There's a relatively
straightforward testcase that demonstrates the issue.  With the following plaintext input

{code:title=customdeliminput.txt}
abcxxx
defxxx
ghixxx
jklxxx
mnoxxx
pqrxxx
stuxxx
vw xxx
xyzxxx
{code}

run a wordcount job like this:

{noformat}
hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar wordcount
-Dmapreduce.input.fileinputformat.split.maxsize=33 -Dtextinputformat.record.delimiter=xxx
customdeliminput.txt wcout
{noformat}

and we can see that one of the records was dropped due to incorrect split processing:

{noformat}
$ hadoop fs -cat wcout/part-r-00000               
abc	1
def	1
ghi	1
jkl	1
mno	1
stu	1
vw	1
xyz	1
{noformat}

I don't think rewinding the seek position by the delimiter length is correct in all cases.
 I believe that will lead to duplicate records rather than dropped records (e.g.: split ends
exactly when a delimiter ends, and both splits end up processing the record after that delimiter).

Instead we can get correct behavior by treating any split in the middle of a multibyte custom
delimiter as if the delimiter ended exactly at the end of the split, i.e.: the consumer of
the prior split is responsible for processing the divided delimiter and the subsequent record.
 The consumer of the next split then tosses the first record up to the first full delimiter
as usual (i.e.: including the partial delimiter at the beginning of the split) and proceeds
to process any subsequent records.  That way we don't get any dropped records or duplicate
records.

I think one way of accomplishing this is to have the LineReader for multibyte custom delimiters
report the current position as the end of the record data *without* the delimiter bytes. 
Then any record that ends exactly at the end of the split or whose delimiter straddles the
split boundary will cause the prior split to consume the extra record necessary.

> org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters
well
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-9867
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9867
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.20.2
>         Environment: CDH3U2 Redhat linux 5.7
>            Reporter: Kris Geusebroek
>
> Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes
has the effect of skipping records from the input.
> This happens when the input splits are split off just after a recordseparator. Starting
point for the next split would be non zero and skipFirstLine would be true. A seek into the
file is done to start - 1 and the text until the first recorddelimiter is ignored (due to
the presumption that this record is already handled by the previous maptask). Since the re
ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and
its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring
a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message