spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RK Aduri <>
Subject MultiThreading in Spark 1.6.0
Date Wed, 20 Jul 2016 18:32:01 GMT
Spark version: 1.6.0 
So, here is the background:

	I have a data frame (Large_Row_DataFrame) which I have created from an
array of row objects and also have another array of unique ids (U_ID) which
I’m going to use to look up into the Large_Row_DataFrame (which is cached)
to do a customized function. 
       For the each lookup for each unique id, I do a collect on the cached
dataframe Large_Row_DataFrame. This means that they would be a bunch of
‘collect’ actions which Spark has to run. Since I’m executing this in a loop
for each unique id (U_ID), all the such collect actions run in sequential

Solution that I implemented:

To avoid the sequential wait of each collect, I have created few subsets of
unique ids with a specific size and run each thread for such a subset. For
each such subset, I executed a thread which is a spark job that runs
collects in sequence only for that subset. And, I have created as many
threads as subsets, each thread handling each subset. Surprisingly, The
resultant run time is better than the earlier sequential approach.

Now the question:

	Is the multithreading a correct approach towards the solution? Or could
there be a better way of doing this.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail:

View raw message