spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Perttu Ranta-aho <>
Subject PySpark & Mesos random crashes
Date Sun, 25 May 2014 19:10:30 GMT

We have a small Mesos (0.18.1) cluster with 4 nodes. Upgraded to Spark
1.0.0-rc9, to overcome some PySpark bugs. But now we are experiencing
random crashes with almost every job. Local jobs run fine, but same code
with same data set in Mesos cluster leads to errors like:

14/05/22 15:03:34 ERROR DAGSchedulerActorSupervisor: eventProcesserActor
failed due to the error EOF reached before Python server acknowledged;
shutting down SparkContext
14/05/22 15:03:34 INFO DAGScheduler: Failed to run saveAsTextFile at
Traceback (most recent call last):
  File "", line 58, in <module>
  File "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/pyspark/", line
910, in saveAsTextFile
line 537, in __call__
line 300, in get_return_value
py4j.protocol.Py4JJavaError14/05/22 15:03:34 INFO TaskSchedulerImpl:
Cancelling stage 0
: An error occurred while calling o44.saveAsTextFile.
: org.apache.spark.SparkException: Job 0 cancelled as part of cancellation
of all jobs

Which looks similar to,
with the exception that the code isn't "bad". Furthermore we are seeing
lots of Mesos(?) warnings like this:

W0522 14:51:19.045565 10497 sched.cpp:901] Attempting to launch task 869
with an unknown offer 20140516-155535-170164746-5050-22001-112345

Which we didn't see with previous Mesos&Spark versions. There aren't any
related errors in Mesos slave logs, instead they report jobs done without
problems. Scala code seems to run without problems, so I suppose this isn't
issue with out Mesos instalation

Any ideas what might be wrong? Or is this bug in Spark?


View raw message