hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small
Date Tue, 06 Feb 2018 21:58:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354606#comment-16354606

Jason Lowe commented on HADOOP-15206:

Thanks for updating the patch!
{quote}Because 4 is a position of the first bz2 block marker, and an input stream will start
reading the first bz2 block if the start position of the input stream is 4.
Ah, right. Thanks for the explanation.
{quote}So, if the input stream tries to read from position 1-4, it will drop the first BZ2
block even though the block marker position is 4.
This doesn't just drop the first bzip2 block, it drops the entire split. This goes back to
my previous comment about the code assuming splits that start between bytes 1-4 are always
tiny. Splits do not have to be equally sized, so theoretically there could be just two splits
where the first split is a two-byte split starting at offset 0 and the other split is the
rest of the file. I believe this change would cause all records to be dropped in that scenario.
To fix that I think we only need to report a position that is one byte beyond the start of
the first bzip2 block rather than at the end of the entire split (i.e.: header_len + 1 rather
than end + 1).

The logic regarding the header seems backwards. If the header is stripped then that means
there was a header present, yet the logic is only adding up bytes for a header length if it
was *not* stripped which is the case when the header is not there.  I'm wondering how it's
working since I think the header is always there in the unit tests.

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, HADOOP-15206.002.patch
> BZip2 can drop and duplicate record when input split file is small. I confirmed that
this issue happens when the input split size is between 1byte and 4bytes.
> I am seeing the following 2 problem behaviors.
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is small
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(317))
- splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
> > The input format read only 99 records but not 100 records
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(318))
- splits[3]=file /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(308))
- conflict with 1 in split 4 at position 8
> {code}
> I experienced this error when I execute Spark (SparkSQL) job under the following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message