Hi, all

I am running a 30GB Wikipedia dataset on a 7-server cluster. Using WikipediaPageRank underexample/Bagel.

My Spark version is bae07e3 [behind 1] fix different versions of commons-lang dependency and apache/spark#746 addendum

The problem is that the job will fail after several stages because of OutofMemory Error. The reason might be that the default executor's memory size is 512M.

I try to modify executor memory size via export SPARK_JAVA_OPTS="-Dspark-cores-max=8 -Dspark.executor.memory=8g", but SPARK_JAVA_OPTS is not recommended in Spark 1.0+. Log also tells ERROR SparkConf.

spark-env.sh:


export SPARK_WORKER_MEMORY=2g
export SPARK_MASTER_IP=192.168.1.12
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=2

Each server has 8G men and 8-core CPU. But after several stages, the job failed and outputs following logs:


14/05/19 22:29:32 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
14/05/19 22:29:32 INFO SparkDeploySchedulerBackend: Executor 10 disconnected, so removing it
14/05/19 22:29:32 ERROR TaskSchedulerImpl: Lost executor 10 on host125: remote Akka client disassociat
...
14/05/19 22:29:33 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
14/05/19 22:29:33 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(10, host125,
java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735)            
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793)                        
    at java.io.DataInputStream.read(DataInputStream.java:100)                                     
    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
    ...
14/05/19 22:29:33 INFO DAGScheduler: Failed to run foreach at Bagel.scala:251
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
14/05/19 22:29:33 INFO TaskSchedulerImpl: Cancelling stage 4
14/05/19 22:29:33 INFO TaskSchedulerImpl: Stage 4 was cancelled
14/05/19 22:29:33 WARN TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: Failed on local exception: java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.chan
nels.SocketChannel[connected local=/192.168.1.123:54254 remote=/192.168.1.12:9000]. 59922 millis timeout left.; Host Details : local hos
t is: "host123/192.168.1.123"; destination host is: "sing12":9000;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
    ...

Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240