Hello Jey,
Thank you for answering. I have found that there are about 6 or 7 'daemon.py' processes in one worker node. Will each core have a 'daemon.py' process? How to decide how many 'daemon.py' processes in one worker node? I have also found that there are many spark related java process in a worker node, so if the java process on worker node is just responsible for communication, why spark needs so many java processes?
Overall, I think the main problem I have for my program is the memory allocation. More specifically, in spark-env.sh, there are two options, SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS. I can also set up spark.executor.memory in SPARK_JAVA_OPTS. So if I have 68g memory in a worker node, how should I distribute memory for these options? At present, I use the default value for SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS and set spark.executor.memory to 20g. It seems that spark will add rdd to spark.executor.memory and I find that each 'daemon.py' will also consume about 7g memory. Now when running my program for a while, the program will use up all memory on a worker node and the master node will report connection errors. (I have 5 worker nodes, each has 8 cores) So I am a little confused about the jobs that the three options are responsible for and how to distribute memories to them. 
Any suggestion will be appreciated.


2013/10/8 Jey Kottalam <jey@cs.berkeley.edu>
Hi Shangyu,

The daemon.py python process is the actual PySpark worker process, and
is launched by the Spark worker when running Python jobs. So, when
using PySpark, the "real computation" is handled by a python process
(via daemon.py), not a java process.

Hope that helps,

On Mon, Oct 7, 2013 at 9:50 PM, Shangyu Luo <lsyurd@gmail.com> wrote:
> Hello!
> I am using Spark 0.7.3 with python version.  Recently when I run some spark
> program on a cluster, I found that some processes invoked by
> spark-0.7.3/python/pyspark/daemon.py would capturing CPU for a long time and
> consume much memory (e.g., 5g for each process). It seemed that the java
> process, which was invoked by
> java -cp
> :/usr/lib/spark-0.7.3/conf:/usr/lib/spark-0.7.3/core/target/scala-2.9.3/classes
> ...  , was 'competing' with the daemon.py for CPU resources. From my
> understanding, the java process should be responsible for the 'real'
> computation in spark.
> So I am wondering what job the daemon.py will work on? Is it normal for it
> to consume a lot of CPU and memory?
> Thanks!
> Best,
> Shangyu Luo
> --
> --
> Shangyu, Luo
> Department of Computer Science
> Rice University


Shangyu, Luo
Department of Computer Science
Rice University

Not Just Think About It, But Do It!
Success is never final.
Losers always whine about their best