spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Cutler <>
Subject Re: count()-ing gz files gives incorrect header check
Date Wed, 21 May 2014 16:40:41 GMT
Hi Nick,

Which version of Hadoop are you using with Spark?  I spotted an issue with
the built-in GzipDecompressor while doing something similar with Hadoop
1.0.4, all my Gzip files were valid and tested yet certain files blew up
from Hadoop/Spark.

The following JIRA ticket goes into more detail and it affects all Hadoop
releases prior to 1.2.X


*Michael Cutler*
Founder, CTO

*Mobile: +44 789 990 7847Email: <>Web: <>*
*Visit us at our offices in Chiswick Park <>*
*Registered in England & Wales, 07916412. VAT No. 130595328*

This email and any files transmitted with it are confidential and may also
be privileged. It is intended only for the person to whom it is addressed.
If you have received this email in error, please inform the sender immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.

On 21 May 2014 14:26, Madhu <> wrote:

> Can you identify a specific file that fails?
> There might be a real bug here, but I have found gzip to be reliable.
> Every time I have run into a "bad header" error with gzip, I had a non-gzip
> file with the wrong extension for whatever reason.
> -----
> Madhu
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at

View raw message