hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small
Date Fri, 02 Feb 2018 23:08:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351044#comment-16351044

Jason Lowe commented on HADOOP-15206:

I found a bit of time to look into this, so I'm dumping my notes here.  I'm not sure when
I'll get some more time to work on it, so if someone feels brave enough to step in feel free.

Here's how I believe records get dropped with very small split sizes:
 # There's only one bz2 block in the file
 # The split size is smaller than 4 bytes
 # First split starts to read the data. It consumes the 'BZh9' magic header then updates the
reported byte position of the stream to be 4
 # At this point the first split reader is beyond the end of the split before it ever read
a single record, so it ends up returning with no records.
 # The second split starts in the middle of the 'BZh9' magic header and scans forward to find
the start of a bz2 block and starts processing the split
 # Since this is not the first split, it throws away the first record with the assumption
the previous split is responsible for it
 # Second split reader proceeds to consume all remaining data, since byte position is not
updated until the next bz2 block and there's only one block
 # End result is first record is lost since first split never consumed it.

I think we can fix this scenario by not advertising a new byte position after reading the
'BZh9' header and only updating the byte position when we read the bz2 block header following
the current bz2 block.

I didn't get as much time to look into the duplicated record scenario, but I suspect multiple
splits end up discovering the beginning of the bz2 block and think it is their block to consume.
Not sure yet how we can easily distinguish which split is the one, true split that is responsible
for consuming the bz2 block given we're hiding the true byte offset from the upper layers
most of the time.

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch
> BZip2 can drop and duplicate record when input split file is small. I confirmed that
this issue happens when the input split size is between 1byte and 4bytes.
> I am seeing the following 2 problem behaviors.
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is small
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(317))
- splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
> > The input format read only 99 records but not 100 records
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(318))
- splits[3]=file /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(308))
- conflict with 1 in split 4 at position 8
> {code}
> I experienced this error when I execute Spark (SparkSQL) job under the following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message