hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aki Tanaka (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small
Date Wed, 07 Feb 2018 01:13:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354819#comment-16354819

Aki Tanaka commented on HADOOP-15206:

Thank you very much for the comments!
This doesn't just drop the first bzip2 block, it drops the entire split. This goes back to
my previous comment about the code assuming splits that start between bytes 1-4 are always
tiny. Splits do not have to be equally sized, so theoretically there could be just two splits
where the first split is a two-byte split starting at offset 0 and the other split is the
rest of the file.
Thank you for explaining the details. I understand the problem.

{quote}The logic regarding the header seems backwards. If the header is stripped then that
means there was a header present, yet the logic is only adding up bytes for a header length
if it was not stripped which is the case when the header is not there.
That's right... Thank you for pointing this out.
After some tests, I noticed the following 2 points.

1. When reading from position 1-3 (on bzip2 header), isHeaderStripped/isSubHeaderStripped
is always false. This is because the current readStreamHeader() works only when the start
position is 0.

2. I set one byte beyond the start of the first bzip2 block (header_len + 1) to the InputStream's
start position, but duplicated records issue still happened. When I set header_len + 5 (9),
we can avoid the problem.

As far as I looked at the test bz2 file using binary editor, the first bz2 marker starts from
position 4 (right after bz2 header). 
Still trying to understand why we need to set header_len + 5.

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, HADOOP-15206.002.patch
> BZip2 can drop and duplicate record when input split file is small. I confirmed that
this issue happens when the input split size is between 1byte and 4bytes.
> I am seeing the following 2 problem behaviors.
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is small
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(317))
- splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
> > The input format read only 99 records but not 100 records
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(318))
- splits[3]=file /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(308))
- conflict with 1 in split 4 at position 8
> {code}
> I experienced this error when I execute Spark (SparkSQL) job under the following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message