spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <>
Subject Parallelizing job execution
Date Fri, 21 Mar 2014 13:43:38 GMT

I have a task that runs on a week's worth of data (let's say) and 
produces a Set of tuples such as Set[(String,Long)] (essentially output 
of countByValue.toMap)

I want to produce 4 sets, one each for a different week and run an 
intersection of the 4 sets.

I have the sequential approach going but obviously, the 4 weeks are 
independent of each other in how they produce the sets (they all work on 
their own data) so the same job that produces a Set for one week can 
just be run as 4 jobs in parallel all with different week start dates.

How is this done in Spark? Is it the runJob() method on SparkContext? 
Any example code anywhere?


View raw message