spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eshishki <itparan...@gmail.com>
Subject pyspark memory usage
Date Tue, 15 Oct 2013 14:29:56 GMT
Hello,

i setuped spark-0.8.0-incubating-bin-cdh4 on 5 node cluster.

I limited SPARK_WORKER_MEMORY to 2g and there are 4 cores per node, so i
expected total memory consumption by spark to be 512mb + 2gb.
Spark webui shows *Memory:* 10.0 GB Total, 0.0 B Used

Then i tried to run simple wordcount.py from examples on a hdfs file, which
size is 11GB.
Spark launched 4 workers per node, and did not limited its total size by
2gb - top showed RES consumption about 750mb and then
Out of memory: Kill process 26336 (python) score 97 or sacrifice child
Killed process 26336, UID 500, (python) total-vm:969696kB,
anon-rss:782976kB, file-rss:196kB

and in the logs

INFO cluster.ClusterTaskSetManager: Loss was due to
org.apache.spark.SparkException
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
        at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:167)
        at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:173)
        at
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:116)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
        at
org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:193)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
        at
org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:149)
        at
org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

So i could not finished the task. Yes, spark resubmited the task, but it
was continuing OOM Killed.

Against a smaller file spark was doing good.

So the question is - why spark does not limit its memory accordinaly and
how to analyze files larger than ram with it?

Thanks.

Mime
View raw message