spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: count()-ing gz files gives java.io.IOException: incorrect header check
Date Sat, 31 May 2014 20:52:21 GMT
That's a neat idea. I'll try that out.


On Sat, May 31, 2014 at 2:45 PM, Patrick Wendell <pwendell@gmail.com> wrote:

> I think there are a few ways to do this... the simplest one might be to
> manually build a set of comma-separated paths that excludes the bad file,
> and pass that to textFile().
>
> When you call textFile() under the hood it is going to pass your filename
> string to hadoopFile() which calls setInputPaths() on the hadoop
> FileInputformat.
>
>
> http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,
> org.apache.hadoop.fs.Path...)
>
> I think this can accept a comma-separate list of paths.
>
> So you could do something like this (this is pseudo-code):
> files = fs.listStatus("s3n://bucket/stuff/*.gz")
> files = files.filter(not the bad file)
> fileStr = files.map(f => f.getPath.toString).mkstring(",")
>
> sc.textFile(fileStr)...
>
> - Patrick
>
>
>
>
> On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> YES, your hunches were correct. I’ve identified at least one file among
>> the hundreds I’m processing that is indeed not a valid gzip file.
>>
>> Does anyone know of an easy way to exclude a specific file or files when
>> calling sc.textFile() on a pattern? e.g. Something like: sc.textFile('s3n://bucket/stuff/*.gz,
>> exclude:s3n://bucket/stuff/bad.gz')
>>
>>
>> On Wed, May 21, 2014 at 11:50 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> Thanks for the suggestions, people. I will try to hone in on which
>>> specific gzipped files, if any, are actually corrupt.
>>>
>>> Michael,
>>>
>>> I’m using Hadoop 1.0.4, which I believe is the default version that gets
>>> deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281
>>> <https://issues.apache.org/jira/browse/HADOOP-5281>, affects Hadoop
>>> 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I
>>> know there is some funkiness in how Hadoop is versioned, so I’m not sure if
>>> this issue is relevant to 1.0.4.
>>>
>>> Were you able to resolve your issue by changing your version of Hadoop?
>>> How did you do that?
>>>
>>> Nick
>>>
>>>
>>> On Wed, May 21, 2014 at 11:38 PM, Andrew Ash <andrew@andrewash.com>
>>> wrote:
>>>
>>>> One thing you can try is to pull each file out of S3 and decompress
>>>> with "gzip -d" to see if it works.  I'm guessing there's a corrupted .gz
>>>> file somewhere in your path glob.
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Wed, May 21, 2014 at 12:40 PM, Michael Cutler <michael@tumra.com>
>>>> wrote:
>>>>
>>>>> Hi Nick,
>>>>>
>>>>> Which version of Hadoop are you using with Spark?  I spotted an issue
>>>>> with the built-in GzipDecompressor while doing something similar with
>>>>> Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files
>>>>> blew up from Hadoop/Spark.
>>>>>
>>>>> The following JIRA ticket goes into more detail
>>>>> https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all
>>>>> Hadoop releases prior to 1.2.X
>>>>>
>>>>> MC
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  *Michael Cutler*
>>>>> Founder, CTO
>>>>>
>>>>>
>>>>> * Mobile: +44 789 990 7847 Email:   michael@tumra.com
>>>>> <michael@tumra.com> Web:     tumra.com
>>>>> <http://tumra.com/?utm_source=signature&utm_medium=email> *
>>>>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
>>>>> *Registered in England & Wales, 07916412. VAT No. 130595328*
>>>>>
>>>>>
>>>>> This email and any files transmitted with it are confidential and may
>>>>> also be privileged. It is intended only for the person to whom it is
>>>>> addressed. If you have received this email in error, please inform the
>>>>> sender immediately. If you are not the intended recipient you must
>>>>> not use, disclose, copy, print, distribute or rely on this email.
>>>>>
>>>>>
>>>>> On 21 May 2014 14:26, Madhu <madhu@madhu.com> wrote:
>>>>>
>>>>>> Can you identify a specific file that fails?
>>>>>> There might be a real bug here, but I have found gzip to be reliable.
>>>>>> Every time I have run into a "bad header" error with gzip, I had
a
>>>>>> non-gzip
>>>>>> file with the wrong extension for whatever reason.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----
>>>>>> Madhu
>>>>>> https://www.linkedin.com/in/msiddalingaiah
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message