spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bo Lu" <...@etinternational.com>
Subject Worker lost during processing large input
Date Thu, 31 Oct 2013 17:21:00 GMT
Hi spark users, 

I   just started to learn running spark standalone application on a   standalone cluster and
I am very impressed by how easy to program using   spark.
But when I run for a large data input (about 30G) on my cluster, I met errors like "Removing
BlockManager" "worker lost".

The application is a Kmeans algorithm for just one iteration and the initial K = 16.

The cluster I have is: 15 nodes with 4G RAM and 4 cores each (one of the nodes behaves both
master and slave)

I am running spark 0.8.0 and spark is built against hadoop 1.1.1 for accessing HDFS

In the spark-env.sh (on all the nodes and in the same directory):

export SPARK_WORKER_MEMORY=8g
export HADOOP_CONF_DIR="/share/hadoop-1.1.1/conf"
export SPARK_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops"

In the driver program:

System.setProperty("spark.default.parallelism", "160");
System.setProperty("spark.storage.memoryFraction", "0.1");
System.setProperty("spark.executor.memory", "8g");
System.setProperty("spark.worker.timeout", "6000");
System.setProperty("spark.akka.frameSize", "10000");
System.setProperty("spark.akka.timeout", "6000");

In   the program, I use groupByKey() which group all the input with respect   to a cluster
id (Key), it turns out that there is one key has 7.8G of   data, that is why I use System.setProperty("spark.executor.memory",
  "8g"), if I lower spark.executor.memory, I will get OOM. But I need to   write all the data
back to HDFS after clustering

I looked at the   Environment tab of the application UI and confirmed that the system   property
are all set but one weird thing is that I get
"13/10/31 11:48:57 WARN master.Master: Removing worker-20131031105954-pen13.xmen.eti-34747
because we got no heartbeat in 60 seconds"
But is this value supposed to be 6000, since I have set System.setProperty("spark.worker.timeout",
"6000"); and System.setProperty("spark.akka.timeout", "6000");

I   also looked at the worker node and found out that there are a lot of   swap going on and
also GC, maybe that is why the worker get lost?

If   anyone can give me a hint on how to configure the system for such a   application and
cluster to solve the problem that will be great.

Thanks.

Bo
              
Mime
View raw message