spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debabrata Ghosh <mailford...@gmail.com>
Subject Calling Pyspark functions in parallel
Date Mon, 19 Mar 2018 05:54:18 GMT
Hi,
             My dataframe is having 2000 rows. For processing each row it
consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds ,
which is a very high time.

              Further, I am contemplating to run the function in parallel.
For example, I would like to divide the total rows in my dataframe by 4 and
accordingly I will prepare a set of 500 rows and want to call my pyspark
function in parallel. I wanted to know if there is any library / pyspark
function which I can leverage to do this execution in parallel.

               Will really appreciate for your feedback as per your
earliest convenience. Thanks,

Debu

Mime
View raw message