spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lonikar <loni...@gmail.com>
Subject Re: Rdd of Rdds
Date Tue, 09 Jun 2015 08:32:22 GMT
Replicating my answer to another question asked today:

Here is one of the reasons why I think RDD[RDD[T]] is not possible:
   * RDD is only a handle to the actual data partitions. It has a
reference/pointer to the /SparkContext /object (/sc/) and a list of
partitions.
   * The SparkContext is an object in the Spark Application/Driver Program's
JVM. Similarly, the list of partitions is also in the JVM of the driver
program. Each partition contains kind of "remote references" to the
partition data on the worker JVMs.
   * The functions passed to RDD's transformations and actions execute in
the worker's JVMs on different nodes. For example, in "*rdd.map { x => x*x
}*", the function performing "*x*x*" runs on the JVMs of the worker nodes
where the partitions of the RDD reside. These JVMs do not have access to the
"sc" since its only on the driver's JVM.
   * Thus, in case of your RDD of RDD: *outerRDD.map { innerRdd =>
innerRDD.filter { x => x*x } }*, the worker nodes will not be able to
execute the filter on innerRDD as the code in the worker does not have
access to "sc" and can not launch a spark job.

Hope it helps. You need to consider List[RDD] or some other collection.

Possibly in future, if and when spark architecture allows workers to launch
spark jobs (the functions passed to transformation or action APIs of RDD),
it will be possible to have RDD of RDD.





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-tp17025p23217.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message