spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: One corrupt gzip in a directory of 100s
Date Thu, 02 Apr 2015 15:01:25 GMT
S3n is governed by the same config parameter. 

Cheers



> On Apr 2, 2015, at 7:33 AM, Romi Kuntsman <romi@totango.com> wrote:
> 
> Hi Ted,
> Not sure what's the config value, I'm using s3n filesystem and not s3.
> 
> The error that I get is the following:
> (so does that mean it's 4 retries?)
> 
>  Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2
in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 11, ip.ec2.internal):
java.net.UnknownHostException: mybucket.s3.amazonaws.com
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:579)
>         at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618)
>         at sun.security.ssl.SSLSocketImpl.<init>(SSLSocketImpl.java:451)
>         at sun.security.ssl.SSLSocketFactoryImpl.createSocket(SSLSocketFactoryImpl.java:140)
>         at org.apache.commons.httpclient.protocol.SSLProtocolSocketFactory.createSocket(SSLProtocolSocketFactory.java:82)
>         at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91)
>         at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158)
>         at java.lang.Thread.run(Thread.java:745)
> 
> Romi Kuntsman, Big Data Engineer
> http://www.totango.com
> 
>> On Wed, Apr 1, 2015 at 6:46 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> bq. writing the output (to Amazon S3) failed
>> 
>> What's the value of "fs.s3.maxRetries" ?
>> Increasing the value should help.
>> 
>> Cheers
>> 
>>> On Wed, Apr 1, 2015 at 8:34 AM, Romi Kuntsman <romi@totango.com> wrote:
>>> What about communication errors and not corrupted files?
>>> Both when reading input and when writing output.
>>> We currently experience a failure of the entire process, if the last stage
>>> of writing the output (to Amazon S3) failed because of a very temporary DNS
>>> resolution issue (easily resolved by retrying).
>>> 
>>> *Romi Kuntsman*, *Big Data Engineer*
>>> 
>>>  http://www.totango.com
>>> 
>>> On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik <GILV@il.ibm.com> wrote:
>>> 
>>> > I actually saw the same issue, where we analyzed some container with few
>>> > hundreds of GBs zip files - one was corrupted and Spark exit with
>>> > Exception on the entire job.
>>> > I like SPARK-6593, since it  can cover also additional cases, not just in
>>> > case of corrupted zip files.
>>> >
>>> >
>>> >
>>> > From:   Dale Richardson <dale__r@hotmail.com>
>>> > To:     "dev@spark.apache.org" <dev@spark.apache.org>
>>> > Date:   29/03/2015 11:48 PM
>>> > Subject:        One corrupt gzip in a directory of 100s
>>> >
>>> >
>>> >
>>> > Recently had an incident reported to me where somebody was analysing a
>>> > directory of gzipped log files, and was struggling to load them into spark
>>> > because one of the files was corrupted - calling
>>> > sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
>>> > executor that was reading that file, which caused the entire job to be
>>> > cancelled after the retry count was exceeded, without any way of catching
>>> > and recovering from the error.  While normally I think it is entirely
>>> > appropriate to stop execution if something is wrong with your input,
>>> > sometimes it is useful to analyse what you can get (as long as you are
>>> > aware that input has been skipped), and treat corrupt files as acceptable
>>> > losses.
>>> > To cater for this particular case I've added SPARK-6593 (PR at
>>> > https://github.com/apache/spark/pull/5250). Which adds an option
>>> > (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
>>> > Input format, but to continue on with the next task.
>>> > Ideally in this case you would want to report the corrupt file paths back
>>> > to the master so they could be dealt with in a particular way (eg moved
to
>>> > a separate directory), but that would require a public API
>>> > change/addition. I was pondering on an addition to Spark's hadoop API that
>>> > could report processing status back to the master via an optional
>>> > accumulator that collects filepath/Option(exception message) tuples so the
>>> > user has some idea of what files are being processed, and what files are
>>> > being skipped.
>>> > Regards,Dale.
>>> >
> 

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message