spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@nengoiksvelzud.com>
Subject Parallelizing job execution
Date Fri, 21 Mar 2014 13:43:38 GMT
Hello,

I have a task that runs on a week's worth of data (let's say) and 
produces a Set of tuples such as Set[(String,Long)] (essentially output 
of countByValue.toMap)

I want to produce 4 sets, one each for a different week and run an 
intersection of the 4 sets.

I have the sequential approach going but obviously, the 4 weeks are 
independent of each other in how they produce the sets (they all work on 
their own data) so the same job that produces a Set for one week can 
just be run as 4 jobs in parallel all with different week start dates.

How is this done in Spark? Is it the runJob() method on SparkContext? 
Any example code anywhere?

Thanks!
Ognen


Mime
View raw message