spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Ryza <sandy.r...@cloudera.com>
Subject Re: Many retries for Python job
Date Fri, 21 Nov 2014 18:41:28 GMT
Hi Brett,

Are you noticing executors dying?  Are you able to check the YARN
NodeManager logs and see whether YARN is killing them for exceeding memory
limits?

-Sandy

On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer <Brett.Meyer@crowdstrike.com>
wrote:

> I’m running a Python script with spark-submit on top of YARN on an EMR
> cluster with 30 nodes.  The script reads in approximately 3.9 TB of data
> from S3, and then does some transformations and filtering, followed by some
> aggregate counts.  During Stage 2 of the job, everything looks to complete
> just fine with no executor failures or resubmissions, but when Stage 3
> starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors.
> Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can
> successfully start.  The whole application eventually completes, but there
> is an addition of about 1+ hour overhead for all of the retries.
>
> I’m trying to determine why there were FetchFailure exceptions, since
> anything computed in the job that could not fit in the available memory
> cache should be by default spilled to disk for further retrieval.  I also
> see some "java.net.ConnectException: Connection refused” and
> "java.io.IOException: sendMessageReliably failed without being ACK’d"
> errors in the logs after a CancelledKeyException followed by
> a ClosedChannelException, but I have no idea why the nodes in the EMR
> cluster would suddenly stop being able to communicate.
>
> If anyone has ideas as to why the data needs to be rerun several times in
> this job, please let me know as I am fairly bewildered about this behavior.
>

Mime
View raw message