spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anders Arpteg <arp...@spotify.com>
Subject Re: Queue independent jobs
Date Fri, 09 Jan 2015 11:45:58 GMT
Awesome, it actually seems to work. Amazing how simple it can be
sometimes...

Thanks Sean!

On Fri, Jan 9, 2015 at 12:42 PM, Sean Owen <sowen@cloudera.com> wrote:

> You can parallelize on the driver side. The way to do it is almost
> exactly what you have here, where you're iterating over a local Scala
> collection of dates and invoking a Spark operation for each. Simply
> write "dateList.par.map(...)" to make the local map proceed in
> parallel. It should invoke the Spark jobs simultaneously.
>
> On Fri, Jan 9, 2015 at 10:46 AM, Anders Arpteg <arpteg@spotify.com> wrote:
> > Hey,
> >
> > Lets say we have multiple independent jobs that each transform some data
> and
> > store in distinct hdfs locations, is there a nice way to run them in
> > parallel? See the following pseudo code snippet:
> >
> > dateList.map(date =>
> > sc.hdfsFile(date).map(transform).saveAsHadoopFile(date))
> >
> > It's unfortunate if they run in sequence, since all the executors are not
> > used efficiently. What's the best way to parallelize execution of these
> > jobs?
> >
> > Thanks,
> > Anders
>

Mime
View raw message