spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: count()-ing gz files gives java.io.IOException: incorrect header check
Date Thu, 22 May 2014 03:38:04 GMT
One thing you can try is to pull each file out of S3 and decompress with
"gzip -d" to see if it works.  I'm guessing there's a corrupted .gz file
somewhere in your path glob.

Andrew


On Wed, May 21, 2014 at 12:40 PM, Michael Cutler <michael@tumra.com> wrote:

> Hi Nick,
>
> Which version of Hadoop are you using with Spark?  I spotted an issue with
> the built-in GzipDecompressor while doing something similar with Hadoop
> 1.0.4, all my Gzip files were valid and tested yet certain files blew up
> from Hadoop/Spark.
>
> The following JIRA ticket goes into more detail
> https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all
> Hadoop releases prior to 1.2.X
>
> MC
>
>
>
>
>  *Michael Cutler*
> Founder, CTO
>
>
> * Mobile: +44 789 990 7847 Email:   michael@tumra.com <michael@tumra.com>
> Web:     tumra.com
> <http://tumra.com/?utm_source=signature&utm_medium=email> *
> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
> *Registered in England & Wales, 07916412. VAT No. 130595328*
>
>
> This email and any files transmitted with it are confidential and may also
> be privileged. It is intended only for the person to whom it is addressed.
> If you have received this email in error, please inform the sender immediately.
> If you are not the intended recipient you must not use, disclose, copy,
> print, distribute or rely on this email.
>
>
> On 21 May 2014 14:26, Madhu <madhu@madhu.com> wrote:
>
>> Can you identify a specific file that fails?
>> There might be a real bug here, but I have found gzip to be reliable.
>> Every time I have run into a "bad header" error with gzip, I had a
>> non-gzip
>> file with the wrong extension for whatever reason.
>>
>>
>>
>>
>> -----
>> Madhu
>> https://www.linkedin.com/in/msiddalingaiah
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Mime
View raw message