spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Queue independent jobs
Date Fri, 09 Jan 2015 11:42:15 GMT
You can parallelize on the driver side. The way to do it is almost
exactly what you have here, where you're iterating over a local Scala
collection of dates and invoking a Spark operation for each. Simply
write "dateList.par.map(...)" to make the local map proceed in
parallel. It should invoke the Spark jobs simultaneously.

On Fri, Jan 9, 2015 at 10:46 AM, Anders Arpteg <arpteg@spotify.com> wrote:
> Hey,
>
> Lets say we have multiple independent jobs that each transform some data and
> store in distinct hdfs locations, is there a nice way to run them in
> parallel? See the following pseudo code snippet:
>
> dateList.map(date =>
> sc.hdfsFile(date).map(transform).saveAsHadoopFile(date))
>
> It's unfortunate if they run in sequence, since all the executors are not
> used efficiently. What's the best way to parallelize execution of these
> jobs?
>
> Thanks,
> Anders

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message