spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <>
Subject Re: spark.python.worker.memory Discontinuity
Date Mon, 02 Nov 2015 07:11:38 GMT
You can actually look at this code base

_memory_limit function returns the amount of memory that you set with
spark.python.worker.memory and is used for groupBy and such operations.

Best Regards

On Fri, Oct 23, 2015 at 11:46 PM, Connor Zanin <> wrote:

> Hi all,
> I am running a simple word count job on a cluster of 4 nodes (24 cores per
> node). I am varying two parameter in the configuration,
> spark.python.worker.memory and the number of partitions in the RDD. My job
> is written in python.
> I am observing a discontinuity in the run time of the job when the
> spark.python.worker.memory is increased past a threshold. Unfortunately, I
> am having trouble understanding exactly what this parameter is doing to
> Spark internally and how it changes Spark's behavior to create this
> discontinuity.
> The documentation describes this parameter as "Amount of memory to use
> per python worker process during aggregation," but I find this is vague (or
> I do not know enough Spark terminology to know what it means).
> I have been pointed to the source code in the past, specifically the
> file where _spill() appears.
> Can anyone explain how this parameter behaves or point me to more
> descriptive documentation? Thanks!
> --
> Regards,
> Connor Zanin
> Computer Science
> University of Delaware

View raw message