hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabor Bota (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15543) IndexOutOfBoundsException when reading bzip2-compressed SequenceFile
Date Tue, 19 Jun 2018 09:44:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516881#comment-16516881
] 

Gabor Bota commented on HADOOP-15543:
-------------------------------------

[~zvenczel], I think this would be an interesting bzip issue.

> IndexOutOfBoundsException when reading bzip2-compressed SequenceFile
> --------------------------------------------------------------------
>
>                 Key: HADOOP-15543
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15543
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.1.0
>            Reporter: Sebastian Nagel
>            Priority: Major
>
> When reading a bzip2-compressed SequenceFile, Hadoop jobs fail with: 
> {noformat}
> IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046)
> {noformat}
> The SequenceFile (669 MB) has been written with the properties
>  - mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
> - mapreduce.output.fileoutputformat.compress.type=BLOCK
> using the native bzip2 library on Hadoop CDH 5.14.2 (Ubuntu 16.04, libbz2-1.0 1.0.6-8).
> The error was seen on two development systems (local mode, no native bzip2 lib configured/installed)
and, so far, is reproducible with Hadoop 3.1.0 and CDH 5.14.2.
> The following Hadoop releases are not affected:  2.7.4, 3.02, CDH 5.14.0. The SequenceFile
is read successfully when these Hadoop packages are used.
> If required I can share the SequenceFile. It's a Nutch CrawlDb (contains [CrawlDatum|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java]
objects.
> Full-stack as seen with 3.1.0:
> {noformat}
> 2018-06-15 10:34:43,198 INFO  mapreduce.Job -  map 93% reduce 0%
> 2018-06-15 10:34:43,532 WARN  mapred.LocalJobRunner - job_local543410164_0001
> java.lang.Exception: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659)
> dest.length(678046).
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
> Caused by: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046).
>         at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
>         at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:496)
>         at java.io.DataInputStream.readFully(DataInputStream.java:195)
>         at java.io.DataInputStream.readFully(DataInputStream.java:169)
>         at org.apache.hadoop.io.WritableUtils.readString(WritableUtils.java:125)
>         at org.apache.hadoop.io.WritableUtils.readStringArray(WritableUtils.java:169)
>         at org.apache.nutch.protocol.ProtocolStatus.readFields(ProtocolStatus.java:177)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:188)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:332)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
>         at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2374)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2358)
>         at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
>         at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:568)
>         at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
>         at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message