spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes Holler <whol...@algebraixdata.com>
Subject RPC Timeout and Abnormally Long JvmGcTime
Date Fri, 29 Apr 2016 23:09:17 GMT
Recently we switched to EMR 4.5/Spark 1.6.1 and have since encountered a new failure scenario.

The primary symptom is that the cluster appears to be stalled. The job has not failed but
will not proceed and has to be killed. One or more RpcTimeoutException s (see below) are usually
found towards the bottom of the log. Another observation is that, using the status API, I
can see that the jvmGcTime value (see JSON snippet below) for some tasks in a stage that never
finished is abnormally large (in the millions). Even for identical queries/data, this problem
doesn't always happen, and when the cluster doesn't hang, I also do not see these large values
for jvmGcTime.

Some questions:
1. What units is jvmGcTime in?
2. If it is in milliseconds, then that seems to indicate garbage collections times into the
tens of minutes - far longer that the queries themselves usually take and also longer than
the default spark.rpc.askTimeout value of 120s. Could this be blocking/starving the promise
and causing the failure?
3. We are doing almost entirely standard SQL, no UDFs, and no use of the RDD API. I was under
the impression that in Spark 1.6, pretty much everything DataFrames is using the custom Tungsten
serialization and unsafe buffers, even in the actual processing (so no actual unroll). Is
this correct? What could be causing such GC churn?

===============================================================================

The exception found in the log (several stack trace lines removed for brevity:

16/04/29 17:43:29 ERROR ContextCleaner: Error cleaning broadcast 0
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout
is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	.
	at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
	.
	... 12 more
16/04/29 17:43:29 WARN BlockManagerMaster: Failed to remove broadcast 0 with removeFromMaster
= true - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout
is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	.
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
	at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:242)
	... 7 more
16/04/29 17:43:29 INFO ContextCleaner: Cleaned accumulator 2
16/04/29 19:33:23 INFO PackagesResourceConfig: Scanning for root resource and provider classes
in the packages:
  org.apache.spark.status.api.v1
16/04/29 19:33:28 INFO ScanningResourceConfig: Root resource classes found:
  class org.apache.spark.status.api.v1.ApiRootResource
16/04/29 19:33:28 INFO ScanningResourceConfig: Provider classes found:
  class org.apache.spark.status.api.v1.JacksonMessageWriter
16/04/29 19:33:28 INFO WebApplicationImpl: Initiating Jersey application, version 'Jersey:
1.9 09/02/2011 11:17 AM'
16/04/29 19:34:09 WARN Errors: The following warnings have been detected with resource and/or
provider classes:
  WARNING: A sub-resource method, public scala.collection.Seq org.apache.spark.status.api.v1.OneStageResource.stageData(int),
with URI template, "", is treated as a resource method

===============================================================================

A sample from the adasdasdasdasd:

"1446406" : {
      "taskId" : 1446406,
      "index" : 593,
      "attempt" : 0,
      "launchTime" : "2016-04-27T11:16:25.005GMT",
      "executorId" : "364",
      "host" : "ip-10-0-0-41.ec2.internal",
      "taskLocality" : "RACK_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorRunTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 1637765,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "inputMetrics" : {
          "bytesRead" : 40824341,
          "recordsRead" : 1130000
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      }
    },



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message