From common-issues-return-142329-apmail-hadoop-common-issues-archive=hadoop.apache.org@hadoop.apache.org Thu Oct 5 16:02:05 2017 Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7D20810075 for ; Thu, 5 Oct 2017 16:02:05 +0000 (UTC) Received: (qmail 56555 invoked by uid 500); 5 Oct 2017 16:02:05 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 56505 invoked by uid 500); 5 Oct 2017 16:02:05 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 56490 invoked by uid 99); 5 Oct 2017 16:02:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Oct 2017 16:02:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 3CFC11A3142 for ; Thu, 5 Oct 2017 16:02:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Xg_OXmyolMDu for ; Thu, 5 Oct 2017 16:02:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 8EA1F5FDD1 for ; Thu, 5 Oct 2017 16:02:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E5707E00A7 for ; Thu, 5 Oct 2017 16:02:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id B9B6A24341 for ; Thu, 5 Oct 2017 16:02:00 +0000 (UTC) Date: Thu, 5 Oct 2017 16:02:00 +0000 (UTC) From: "Jason Lowe (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-14919) BZip2 drops records when reading data in splits MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193088#comment-16193088 ] Jason Lowe commented on HADOOP-14919: ------------------------------------- Thanks for taking the patch for a test drive! Glad to hear it fixes the problem and doesn't seem to regress anything so far. > BZip2 drops records when reading data in splits > ----------------------------------------------- > > Key: HADOOP-14919 > URL: https://issues.apache.org/jira/browse/HADOOP-14919 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1 > Reporter: Aki Tanaka > Assignee: Jason Lowe > Priority: Critical > Attachments: 250000.bz2, HADOOP-14919.001.patch, HADOOP-14919-test.patch > > > BZip2 can drop records when reading data in splits. This problem was already discussed before in HADOOP-11445 and HADOOP-13270. But we still have a problem in corner case, causing lost data blocks. > > I attached a unit test for this issue. You can reproduce the problem if you run the unit test. > > First, this issue happens when position of newly created stream is equal to start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, the issue I am reporting does not happen when we run these tests because this issue happens only when the start of split byte block includes both block marker and compressed data. > > BZip2 block marker - 0x314159265359 (001100010100000101011001001001100101001101011001) > > blockEndingInCR.txt.bz2 (Start of Split - 136504): > {code:java} > $ xxd -l 6 -g 1 -b -seek 136498 ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2 > 0021532: 00110001 01000001 01011001 00100110 01010011 01011001 1AY&SY > {code} > > Test bz2 File (Start of Split - 203426) > {code:java} > $ xxd -l 7 -g 1 -b -seek 203419 250000.bz2 > 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011 .(+$.k > 0031aa1: 00101111 / > {code} > > Let's say a job splits this test bz2 file into two splits at the start of split (position 203426). > The former split does not read records which start position 203426 because BZip2 says the position of these dropped records is 203427. The latter split does not read the records because BZip2CompressionInputStream read the block from position 320955. > Due to this behavior, records between 203427 and 320955 are lost. > Also, if we reverted the changes in HADOOP-13270, we will not see this issue. We will see HADOOP-13270 issue though. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org