hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small
Date Tue, 13 Feb 2018 00:03:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361615#comment-16361615
] 

Jason Lowe commented on HADOOP-15206:
-------------------------------------

Thanks for updating the patch!

bq. In the current implementation, read only "BZ" header when the read mode is CONTINUOUS.
Do you think we should keep this?

Yes, because it's not important to read the header when the codec is in BLOCK mode.  IIUC
the main difference between CONTINUOUS and BLOCK mode is that BLOCK mode will be used when
processing splits and CONTINUOUS mode is used when we're simply trying to decompress the data
in one big chunk (i.e.: no splits).  BLOCK mode always will scan for the start of the bz2
block, so it will automatically skip a bz2 file header while searching for the start of the
first bz2 block from the specified start offset.

Given the splittable codec is always scanning for the block and doesn't really care what bytes
are being skipped, I'm now thinking we can go back to a much simpler implementation.  I think
the code can check if we're in BLOCK mode to know whether we are processing splits or not.
 If we are in BLOCK mode we avoid advertising the byte position if start offset is zero just
as the previous patches.  In BLOCK mode we should also skip to file offset HEADER_LEN + SUB_HEADER_LEN
+ 1 if the start position is >=0 and < HEADER_LEN + SUB_HEADER_LEN.  That will put us
one byte past the start of the first bz2 block, and BLOCK mode will automatically scan forward
to the next block.  This proposal is very similar to what was implemented in patch 003.  I
think we just need to make it only do the position adjustment if we're in BLOCK mode.

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, HADOOP-15206.002.patch,
HADOOP-15206.003.patch, HADOOP-15206.004.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I confirmed that
this issue happens when the input split size is between 1byte and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(317))
- splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(318))
- splits[3]=file /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat (TestTextInputFormat.java:verifyPartitions(308))
- conflict with 1 in split 4 at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message