spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: [PySpark] Releasing memory after a spark job is finished
Date Mon, 04 Jun 2018 19:41:41 GMT
Why don’t you modularize your code and write for each process an independent python program
that is submitted via Spark?

Not sure though if Spark local make sense. If you don’t have a cluster then a normal python
program can be much better.

> On 4. Jun 2018, at 21:37, Shuporno Choudhury <shuporno.choudhury@gmail.com> wrote:
> 
> Hi everyone,
> I am trying to run a pyspark code on some data sets sequentially [basically 1. Read data
into a dataframe 2.Perform some join/filter/aggregation 3. Write modified data in parquet
format to a target location]
> Now, while running this pyspark code across multiple independent data sets sequentially,
the memory usage from the previous data set doesn't seem to get released/cleared and hence
spark's memory consumption (JVM memory consumption from Task Manager) keeps on increasing
till it fails at some data set.
> So, is there a way to clear/remove dataframes that I know are not going to be used later?

> Basically, can I clear out some memory programmatically (in the pyspark code) when processing
for a particular data set ends?
> At no point, I am caching any dataframe (so unpersist() is also not a solution).
> 
> I am running spark using local[*] as master. There is a single SparkSession that is doing
all the processing.
> If it is not possible to clear out memory, what can be a better approach for this problem?
> 
> Can someone please help me with this and tell me if I am going wrong anywhere?
> 
> --Thanks,
> Shuporno Choudhury

Mime
View raw message