spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: [PySpark] Releasing memory after a spark job is finished
Date Tue, 05 Jun 2018 05:15:51 GMT
Additionally I meant with modularization that jobs that have really nothing to do with each
other should be in separate python programs

> On 5. Jun 2018, at 04:50, Thakrar, Jayesh <jthakrar@conversantmedia.com> wrote:
> 
> Disclaimer - I use Spark with Scala and not Python.
>  
> But I am guessing that Jorn's reference to modularization is to ensure that you do the
processing inside methods/functions and call those methods sequentially.
> I believe that as long as an RDD/dataset variable is in scope, its memory may not be
getting released.
> By having functions, they will get out of scope and their memory can be released.
>  
> Also, assuming that the variables are not daisy-chained/inter-related as that too will
not make it easy.
>  
>  
> From: Jay <jayadeep.jayaraman@gmail.com>
> Date: Monday, June 4, 2018 at 9:41 PM
> To: Shuporno Choudhury <shuporno.choudhury@gmail.com>
> Cc: "Jörn Franke [via Apache Spark User List]" <ml+s1001560n32458h84@n3.nabble.com>,
<user@spark.apache.org>
> Subject: Re: [PySpark] Releasing memory after a spark job is finished
>  
> Can you tell us what version of Spark you are using and if Dynamic Allocation is enabled
? 
>  
> Also, how are the files being read ? Is it a single read of all files using a file matching
regex or are you running different threads in the same pyspark job?
>  
>  
> 
> On Mon 4 Jun, 2018, 1:27 PM Shuporno Choudhury, <shuporno.choudhury@gmail.com>
wrote:
> Thanks a lot for the insight.
> Actually I have the exact same transformations for all the datasets, hence only 1 python
code.
> Now, do you suggest that I run different spark-submit for all the different datasets
given that I have the exact same transformations?
>  
> On Tue 5 Jun, 2018, 1:48 AM Jörn Franke [via Apache Spark User List], <ml+s1001560n32458h84@n3.nabble.com>
wrote:
> Yes if they are independent with different transformations then I would create a separate
python program. Especially for big data processing frameworks one should avoid to put everything
in one big monotholic applications.
>  
> 
> On 4. Jun 2018, at 22:02, Shuporno Choudhury <[hidden email]> wrote:
> 
> Hi,
>  
> Thanks for the input.
> I was trying to get the functionality first, hence I was using local mode. I will be
running on a cluster definitely but later.
>  
> Sorry for my naivety, but can you please elaborate on the modularity concept that you
mentioned and how it will affect whatever I am already doing?
> Do you mean running a different spark-submit for each different dataset when you say
'an independent python program for each process '?
>  
> On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] <[hidden email]>
wrote:
> Why don’t you modularize your code and write for each process an independent python
program that is submitted via Spark?
>  
> Not sure though if Spark local make sense. If you don’t have a cluster then a normal
python program can be much better.
> 
> On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden email]> wrote:
> 
> Hi everyone,
> I am trying to run a pyspark code on some data sets sequentially [basically 1. Read data
into a dataframe 2.Perform some join/filter/aggregation 3. Write modified data in parquet
format to a target location]
> Now, while running this pyspark code across multiple independent data sets sequentially,
the memory usage from the previous data set doesn't seem to get released/cleared and hence
spark's memory consumption (JVM memory consumption from Task Manager) keeps on increasing
till it fails at some data set.
> So, is there a way to clear/remove dataframes that I know are not going to be used later?

> Basically, can I clear out some memory programmatically (in the pyspark code) when processing
for a particular data set ends?
> At no point, I am caching any dataframe (so unpersist() is also not a solution).
>  
> I am running spark using local[*] as master. There is a single SparkSession that is doing
all the processing.
> If it is not possible to clear out memory, what can be a better approach for this problem?
>  
> Can someone please help me with this and tell me if I am going wrong anywhere?
>  
> --Thanks,
> Shuporno Choudhury
>  
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html
> To start a new topic under Apache Spark User List, email [hidden email]
> To unsubscribe from Apache Spark User List, click here.
> NAML
> 
>  
> --
> --Thanks,
> Shuporno Choudhury
>  
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32458.html
> To start a new topic under Apache Spark User List, email ml+s1001560n1h17@n3.nabble.com

> To unsubscribe from Apache Spark User List, click here.
> NAML

Mime
View raw message