spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Imran Rashid (JIRA)" <>
Subject [jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs
Date Fri, 16 Jan 2015 00:44:34 GMT


Imran Rashid commented on SPARK-3622:

In some ways this kinda reminds of the problem w/ accumulators and lazy transformations. 
Accumulators are basically multiple output, but Spark itself provides no way to track when
that output is ready.  Its up to the developer to figure it out.

If you do a transformation on {{rddA}} you've got to know to "wait" until you've also got
a transformation on {{rddB}} ready as well.  Probably the simplest case for this is filtering
records by some condition, but keeping both the good and bad records, ala scala collection's
{{partition}} method.  I think this has come up on the user mailing list a few times.

What about having some new type {{MultiRDD}}, which only runs when you've queued up an action
on *all* RDDs?  eg. something like:

val input: RDD[String] = ...
val goodAndBad: MultiRdd[String, String] = input.partition{ str => MyRecordParser.isOk(str)}
val bad: RDD[String] = goodAndBad.get(1)
bad.saveAsTextFile(...) // doesn't do anything yet
val parsed: RDD[MyCaseClass] = goodAndBad.get(0).map{str => MyRecordParser.parse(str)}
val tmp: RDD[MyCaseClass] ={f1}.filter{f2}.mapPartitions{f3} //still don't do anything
val result = tmp.reduce{reduceFunc} // now everything gets run

> Provide a custom transformation that can output multiple RDDs
> -------------------------------------------------------------
>                 Key: SPARK-3622
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
> All existing transformations return just one RDD at most, even for those which takes
user-supplied functions such as mapPartitions() . However, sometimes a user provided function
may need to output multiple RDDs. For instance, a filter function that divides the input RDD
into serveral RDDs. While it's possible to get multiple RDDs by transforming the same RDD
multiple times, it may be more efficient to do this concurrently in one shot. Especially user's
existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function can output
different data sets to be consumed by subsequent stages.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message