hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14919) BZip2 drops records when reading data in splits
Date Mon, 30 Oct 2017 19:54:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225643#comment-16225643
] 

Chris Douglas commented on HADOOP-14919:
----------------------------------------

bq. We should not updated the reported position when skipping just the 'BZh9' bytes and only
when we move from block mark to block mark. The existing behavior of skipping at the file
offset 0 is benign, but I don't think we want/need to update reported position when skipping
these extra bytes mid-stream.
+1 I saw this was skipped from the codec, and wanted to be sure (if concatenation is supported)
that your fix worked in that case. But as you say, it's moot if it doesn't support concatenated
bz2 files.

bq. I had a little trouble following the example and knowing what was a record delimiter
Sorry. If split0 stopped at the end of stream and split1 skipped to the next delimiter, then
the {{oooooo}} bytes would be skipped.

bq.  See TestLineRecordReader#testBzipWithMultibyteDelimiter
Thanks, I'd missed that.

+1 for committing this. Thanks for the detailed fix and followup, Jason.

> BZip2 drops records when reading data in splits
> -----------------------------------------------
>
>                 Key: HADOOP-14919
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14919
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Aki Tanaka
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: 250000.bz2, HADOOP-14919-test.patch, HADOOP-14919.001.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already discussed
before in HADOOP-11445 and HADOOP-13270. But we still have a problem in corner case, causing
lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you run the unit
test.
>  
> First, this issue happens when position of newly created stream is equal to start of
split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker,
etc). However, the issue I am reporting does not happen when we run these tests because this
issue happens only when the start of split byte block includes both block marker and compressed
data.
>  
> BZip2 block marker - 0x314159265359 (001100010100000101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 01000001 01011001 00100110 01010011 01011001  1AY&SY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 250000.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 00101111                                               /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of split (position
203426).
> The former split does not read records which start position 203426 because BZip2 says
the position of these dropped records is 203427. The latter split does not read the records
because BZip2CompressionInputStream read the block from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.
> Also, if we reverted the changes in HADOOP-13270, we will not see this issue. We will
see HADOOP-13270 issue though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message