spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Hill <greg.h...@RACKSPACE.COM>
Subject Re: clarification for some spark on yarn configuration options
Date Tue, 23 Sep 2014 14:23:50 GMT
Thanks for looking into it.  I'm trying to avoid making the user pass in any parameters by
configuring it to use the right values for the cluster size by default, hence my reliance
on the configuration.  I'd rather just use spark-defaults.conf than the environment variables,
and looking at the code you modified, I don't see any place it's picking up spark.driver.memory
either.  Is that a separate bug?

Greg


From: Andrew Or <andrew@databricks.com<mailto:andrew@databricks.com>>
Date: Monday, September 22, 2014 8:11 PM
To: Nishkam Ravi <nravi@cloudera.com<mailto:nravi@cloudera.com>>
Cc: Greg <greg.hill@rackspace.com<mailto:greg.hill@rackspace.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>"
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

>From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not actually picked up
in cluster mode. This is a bug and I have opened a PR to fix it: https://github.com/apache/spark/pull/2500.
For now, please use --driver-memory instead, which should work for both client and cluster
mode.

Thanks for pointing this out,
-Andrew

2014-09-22 14:04 GMT-07:00 Nishkam Ravi <nravi@cloudera.com<mailto:nravi@cloudera.com>>:
Maybe try --driver-memory if you are using spark-submit?

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill <greg.hill@rackspace.com<mailto:greg.hill@rackspace.com>>
wrote:
Ah, I see.  It turns out that my problem is that that comparison is ignoring SPARK_DRIVER_MEMORY
and comparing to the default of 512m.  Is that a bug that's since fixed?  I'm on 1.0.1 and
using 'yarn-cluster' as the master.  'yarn-client' seems to pick up the values and works fine.

Greg

From: Nishkam Ravi <nravi@cloudera.com<mailto:nravi@cloudera.com>>
Date: Monday, September 22, 2014 3:30 PM
To: Greg <greg.hill@rackspace.com<mailto:greg.hill@rackspace.com>>
Cc: Andrew Or <andrew@databricks.com<mailto:andrew@databricks.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>"
<user@spark.apache.org<mailto:user@spark.apache.org>>

Subject: Re: clarification for some spark on yarn configuration options

Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not
higher) than spark.driver.memory.

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill <greg.hill@rackspace.com<mailto:greg.hill@rackspace.com>>
wrote:
I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting
to deploy this on production-size servers.  It's complaining that I'm not allocating enough
memory to the memoryOverhead values.  I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead
to be higher than spark.driver.memory, but that makes no sense to me since that memory is
just supposed to be what YARN needs on top of what you're allocating for Spark.  My understanding
was that the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or <andrew@databricks.com<mailto:andrew@databricks.com>>
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg <greg.hill@rackspace.com<mailto:greg.hill@rackspace.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent "spark.executor.instances"
is just another way to set the same thing in your spark-defaults.conf. Maybe this should be
documented. :)

"spark.yarn.executor.memoryOverhead" is just an additional margin added to "spark.executor.memory"
for the container. In addition to the executor's memory, the container in which the executor
is launched needs some extra memory for system processes, and this is what this "overhead"
(somewhat of a misnomer) is for. The same goes for the driver equivalent.

"spark.driver.memory" behaves differently depending on which version of Spark you are using.
If you are using Spark 1.1+ (this was released very recently), you can directly set "spark.driver.memory"
and this will take effect. Otherwise, setting this doesn't actually do anything for client
deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY
in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark,
which go through bin/spark-submit), pass the "--driver-memory" command line argument.

If you want your PySpark application (driver) to pick up extra class path, you can pass the
"--driver-class-path" to Spark submit. If you are using Spark 1.1+, you may set "spark.driver.extraClassPath"
in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH),
though this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill <greg.hill@rackspace.com<mailto:greg.hill@rackspace.com>>:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per
slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, but it's not
in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory
?  Same question for the 'driver' variant, but I assume it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use the environment
variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark interactive to
pick up the yarn class path?  The ones that work for spark-shell and spark-submit don't seem
to work for pyspark.

Thanks in advance.

Greg





Mime
View raw message