spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nithin Asokan <anithi...@gmail.com>
Subject Re: Executor Lost Failure
Date Tue, 29 Sep 2015 16:11:16 GMT
Try increasing memory (--conf spark.executor.memory=3g or
--executor-memory) for executors. Here is something I noted from your logs

15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory
threshold of 1024.0 KB for computing block rdd_2_1813 in memory.
15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache
rdd_2_1813 in memory!
(computed 840.0 B so far)

On Tue, Sep 29, 2015 at 11:02 AM Anup Sawant <anupsatishsawant@gmail.com>
wrote:

> Hi all,
> Any idea why I am getting 'Executor heartbeat timed out' ? I am fairly new
> to Spark so I have less knowledge about the internals of it. The job was
> running for a day or so on 102 Gb of data with 40 workers.
> -Best,
> Anup.
>
> 15/09/29 06:32:03 ERROR TaskSchedulerImpl: Lost executor driver on
> localhost: Executor heartbeat timed out after 395987 ms
> 15/09/29 06:32:03 WARN MemoryStore: Failed to reserve initial memory
> threshold of 1024.0 KB for computing block rdd_2_1813 in memory.
> 15/09/29 06:32:03 WARN MemoryStore: Not enough space to cache rdd_2_1813
> in memory! (computed 840.0 B so far)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1782.0 in stage 2713.0
> (TID 9101184, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 ERROR TaskSetManager: Task 1782 in stage 2713.0 failed 1
> times; aborting job
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1791.0 in stage 2713.0
> (TID 9101193, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1800.0 in stage 2713.0
> (TID 9101202, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1764.0 in stage 2713.0
> (TID 9101166, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1773.0 in stage 2713.0
> (TID 9101175, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1809.0 in stage 2713.0
> (TID 9101211, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1794.0 in stage 2713.0
> (TID 9101196, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1740.0 in stage 2713.0
> (TID 9101142, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1803.0 in stage 2713.0
> (TID 9101205, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1812.0 in stage 2713.0
> (TID 9101214, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1785.0 in stage 2713.0
> (TID 9101187, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1767.0 in stage 2713.0
> (TID 9101169, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1776.0 in stage 2713.0
> (TID 9101178, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1797.0 in stage 2713.0
> (TID 9101199, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1779.0 in stage 2713.0
> (TID 9101181, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1806.0 in stage 2713.0
> (TID 9101208, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1788.0 in stage 2713.0
> (TID 9101190, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1761.0 in stage 2713.0
> (TID 9101163, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1755.0 in stage 2713.0
> (TID 9101157, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1796.0 in stage 2713.0
> (TID 9101198, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1778.0 in stage 2713.0
> (TID 9101180, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1787.0 in stage 2713.0
> (TID 9101189, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1805.0 in stage 2713.0
> (TID 9101207, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1790.0 in stage 2713.0
> (TID 9101192, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1781.0 in stage 2713.0
> (TID 9101183, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1808.0 in stage 2713.0
> (TID 9101210, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1799.0 in stage 2713.0
> (TID 9101201, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1772.0 in stage 2713.0
> (TID 9101174, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1763.0 in stage 2713.0
> (TID 9101165, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1802.0 in stage 2713.0
> (TID 9101204, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1748.0 in stage 2713.0
> (TID 9101150, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1775.0 in stage 2713.0
> (TID 9101177, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1766.0 in stage 2713.0
> (TID 9101168, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1811.0 in stage 2713.0
> (TID 9101213, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1793.0 in stage 2713.0
> (TID 9101195, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1769.0 in stage 2713.0
> (TID 9101171, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1810.0 in stage 2713.0
> (TID 9101212, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1801.0 in stage 2713.0
> (TID 9101203, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1795.0 in stage 2713.0
> (TID 9101197, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1777.0 in stage 2713.0
> (TID 9101179, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1786.0 in stage 2713.0
> (TID 9101188, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1804.0 in stage 2713.0
> (TID 9101206, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1813.0 in stage 2713.0
> (TID 9101215, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1807.0 in stage 2713.0
> (TID 9101209, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1789.0 in stage 2713.0
> (TID 9101191, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1780.0 in stage 2713.0
> (TID 9101182, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1798.0 in stage 2713.0
> (TID 9101200, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1792.0 in stage 2713.0
> (TID 9101194, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1765.0 in stage 2713.0
> (TID 9101167, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1774.0 in stage 2713.0
> (TID 9101176, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1783.0 in stage 2713.0
> (TID 9101185, localhost): ExecutorLostFailure (executor driver lost)
> 15/09/29 06:32:03 WARN TaskSetManager: Lost task 1756.0 in stage 2713.0
> (TID 9101158, localhost): ExecutorLostFailure (executor driver lost)
> [Stage 2713:=========================>                       (1762 + 51)
> / 3354]15/09/29 06:32:03 WARN SparkContext: Killing executors is only
> supported in coarse-grained mode
> 15/09/29 06:32:04 ERROR BlockManager: Failed to report rdd_2_3032 to
> master; giving up.
> Traceback (most recent call last):
>   File "/data/home/as198/sdword2vec.py", line 139, in <module>
>     main()
>   File "/data/home/as198/sdword2vec.py", line 136, in main
>     tryGensim()
>   File "/data/home/as198/sdword2vec.py", line 114, in tryGensim
>     model_dm.build_vocab(articles)
>   File
> "/usr/lib/python2.7/site-packages/gensim-0.12.2-py2.7-linux-x86_64.egg/gensim/models/word2vec.py",
> line 495, in build_vocab
>     self.scan_vocab(sentences, trim_rule=trim_rule)  # initial survey
>   File
> "/usr/lib/python2.7/site-packages/gensim-0.12.2-py2.7-linux-x86_64.egg/gensim/models/doc2vec.py",
> line 620, in scan_vocab
>     for document_no, document in enumerate(documents):
>   File "/data/home/ass198/sdword2vec.py", line 97, in __iter__
>     for article in labeled_rdd.collect():
>   File
> "/usr/local/spark-1.5.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
> line 773, in collect
>   File
> "/usr/local/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/usr/local/spark-1.5.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 1782 in stage 2713.0 failed 1 times, most recent failure: Lost task 1782.0
> in stage 2713.0 (TID 9101184, localhost): ExecutorLostFailure (executor
> driver lost)
> Driver stacktrace:
>         at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:
> 1280)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.
> apply(DAGScheduler.scala:1268)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.
> apply(DAGScheduler.scala:1267)
>         at scala.collection.mutable.ResizableArray$class
> .foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:
> 47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267
> )
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.
> apply(DAGScheduler.scala:697)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.
> apply(DAGScheduler.scala:697)
>         at scala.Option.foreach(Option.scala:236)
>         at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:
> 697)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:
> 1493)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:
> 1455)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:
> 1444)
>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>         at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>         at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905
> )
>         at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:
> 147)
>         at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:
> 108)
>         at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>         at
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373
> )
>         at
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>         at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
> 43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:
> 379)
>         at py4j.Gateway.invoke(Gateway.java:259)
>         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:
> 133)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>         at java.lang.Thread.run(Thread.java:745)
>
>

Mime
View raw message