spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Tso-Guillen <v...@paxata.com>
Subject Re: Parallelize independent tasks
Date Tue, 02 Dec 2014 17:08:35 GMT
dirs.par.foreach { case (src,dest) =>
sc.textFile(src).process.saveAsFile(dest) }

Is that sufficient for you?

On Tuesday, December 2, 2014, Anselme Vignon <anselme.vignon@flaminem.com>
wrote:

> Hi folks,
>
>
> We have written a spark job that scans multiple hdfs directories and
> perform transformations on them.
>
> For now, this is done with a simple for loop that starts one task at
> each iteration. This looks like:
>
> dirs.foreach { case (src,dest) =>
> sc.textFile(src).process.saveAsFile(dest) }
>
>
> However, each iteration is independent, and we would like to optimize
> that by running
> them with spark simultaneously (or in a chained fashion), such that we
> don't have
> idle executors at the end of each iteration (some directories
> sometimes only require one partition)
>
>
> Has anyone already done such a thing? How would you suggest we could do
> that?
>
> Cheers,
>
> Anselme
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-help@spark.apache.org <javascript:;>
>
>

Mime
View raw message