Thank you for answering. I have found that there are about 6 or 7 'daemon.py' processes in one worker node. Will each core have a 'daemon.py' process? How to decide how many 'daemon.py' processes in one worker node? I have also found that there are many spark related java process in a worker node, so if the java process on worker node is just responsible for communication, why spark needs so many java processes?
Overall, I think the main problem I have for my program is the memory allocation. More specifically, in spark-env.sh, there are two options, SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS. I can also set up spark.executor.memory in SPARK_JAVA_OPTS. So if I have 68g memory in a worker node, how should I distribute memory for these options? At present, I use the default value for SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS and set spark.executor.memory to 20g. It seems that spark will add rdd to spark.executor.memory and I find that each 'daemon.py' will also consume about 7g memory. Now when running my program for a while, the program will use up all memory on a worker node and the master node will report connection errors. (I have 5 worker nodes, each has 8 cores) So I am a little confused about the jobs that the three options are responsible for and how to distribute memories to them.
Any suggestion will be appreciated.