hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aki Tanaka (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small
Date Wed, 07 Feb 2018 01:13:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354819#comment-16354819
] 

Aki Tanaka commented on HADOOP-15206:
-------------------------------------

Thank you very much for the comments!
{quote}
This doesn't just drop the first bzip2 block, it drops the entire split. This goes back to
my previous comment about the code assuming splits that start between bytes 1-4 are always
tiny. Splits do not have to be equally sized, so theoretically there could be just two splits
where the first split is a two-byte split starting at offset 0 and the other split is the
rest of the file.
{quote}
Thank you for explaining the details. I understand the problem.

 
{quote}The logic regarding the header seems backwards. If the header is stripped then that
means there was a header present, yet the logic is only adding up bytes for a header length
if it was not stripped which is the case when the header is not there.
{quote}
That's right... Thank you for pointing this out.
After some tests, I noticed the following 2 points.

1. When reading from position 1-3 (on bzip2 header), isHeaderStripped/isSubHeaderStripped
is always false. This is because the current readStreamHeader() works only when the start
position is 0.

2. I set one byte beyond the start of the first bzip2 block (header_len + 1) to the InputStream's
start position, but duplicated records issue still happened. When I set header_len + 5 (9),
we can avoid the problem.

As far as I looked at the test bz2 file using binary editor, the first bz2 marker starts from
position 4 (right after bz2 header). 
Still trying to understand why we need to set header_len + 5.

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, HADOOP-15206.002.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I confirmed that
this issue happens when the input split size is between 1byte and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(317))
- splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(318))
- splits[3]=file /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(308))
- conflict with 1 in split 4 at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message