spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: [PySpark] Releasing memory after a spark job is finished
Date Mon, 04 Jun 2018 20:18:27 GMT
Yes if they are independent with different transformations then I would create a separate python
program. Especially for big data processing frameworks one should avoid to put everything
in one big monotholic applications.

> On 4. Jun 2018, at 22:02, Shuporno Choudhury <> wrote:
> Hi,
> Thanks for the input.
> I was trying to get the functionality first, hence I was using local mode. I will be
running on a cluster definitely but later.
> Sorry for my naivety, but can you please elaborate on the modularity concept that you
mentioned and how it will affect whatever I am already doing?
> Do you mean running a different spark-submit for each different dataset when you say
'an independent python program for each process '?
>> On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] <>
>> Why don’t you modularize your code and write for each process an independent python
program that is submitted via Spark?
>> Not sure though if Spark local make sense. If you don’t have a cluster then a normal
python program can be much better.
>>> On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden email]> wrote:
>>> Hi everyone,
>>> I am trying to run a pyspark code on some data sets sequentially [basically 1.
Read data into a dataframe 2.Perform some join/filter/aggregation 3. Write modified data in
parquet format to a target location]
>>> Now, while running this pyspark code across multiple independent data sets sequentially,
the memory usage from the previous data set doesn't seem to get released/cleared and hence
spark's memory consumption (JVM memory consumption from Task Manager) keeps on increasing
till it fails at some data set.
>>> So, is there a way to clear/remove dataframes that I know are not going to be
used later? 
>>> Basically, can I clear out some memory programmatically (in the pyspark code)
when processing for a particular data set ends?
>>> At no point, I am caching any dataframe (so unpersist() is also not a solution).
>>> I am running spark using local[*] as master. There is a single SparkSession that
is doing all the processing.
>>> If it is not possible to clear out memory, what can be a better approach for
this problem?
>>> Can someone please help me with this and tell me if I am going wrong anywhere?
>>> --Thanks,
>>> Shuporno Choudhury
>> If you reply to this email, your message will be added to the discussion below:
>> To start a new topic under Apache Spark User List, email

>> To unsubscribe from Apache Spark User List, click here.
> -- 
> --Thanks,
> Shuporno Choudhury

View raw message