You’d have to add a filter after the cogroup too. Cogroup gives you (key, (list of values in RDD 1, list in RDD 2)).

Also one small thing, instead of setting the value to None, it may be cheaper to use null.


On Jan 23, 2014, at 10:30 PM, Andrew Ash <> wrote:

You mean cogroup like this? => (v,None)).cogroup( => (v,None))).keys

If so I might send a PR to start code review for getting this into master.

Good to know about the strategy for sharding RDDs and for the core operations.


On Thu, Jan 23, 2014 at 11:17 PM, Matei Zaharia <> wrote:
Using cogroup would probably be slightly more efficient than join because you don’t have to generate every pair of keys for elements that occur in each dataset multiple times.

We haven’t tried to explicitly separate the API between “core” methods and others, but in practice, everything can be built on mapPartitions and cogroup for transformations, and SparkContext.runJob (internal method) for actions. What really matters is actually the level at which the code sees dependencies in the DAGScheduler, which is done through the Dependency class. There are only two types of dependencies (narrow and shuffle), which correspond to those operations above. So in a sense there is this separation at the lowest level. But for the levels above, the goal was first and foremost to make the API as usable as possible, which meant giving people quick access to all the operations that might be useful, and dealing with how we’ll implement those later. Over time it will be possible to divide things like RDD.scala into multiple traits if they become unwieldy.


On Jan 23, 2014, at 9:40 PM, Andrew Ash <> wrote:

And I think the followup to Ian's question:

Is there a way to implement .intersect() in the core API that's more efficient than the .join() method Evan suggested?


On Thu, Jan 23, 2014 at 10:26 PM, Ian O'Connell <> wrote:
Is there any separation in the API between functions that can be built solely on the existing exposed public API and ones which require access to internals?

Just to maybe avoid bloat for composite functions like this that are for user convenience?

(Ala something like lua's aux api vs core api?)

On Thu, Jan 23, 2014 at 8:33 PM, Matei Zaharia <> wrote:
I’d be happy to see this added to the core API.


On Jan 23, 2014, at 5:39 PM, Andrew Ash <> wrote:

Ah right of course -- perils of typing code without running it!

It feels like this is a pretty core operation that should be added to the main RDD API.  Do other people not run into this often?

When I'm validating a foreign key join in my cluster I often check to make sure that the foreign keys land on valid values on the referenced table, and the way I do that is checking to see what percentage of the references actually land.

On Thu, Jan 23, 2014 at 6:36 PM, Evan R. Sparks <> wrote:
Yup (well, with _._1 at the end!)

On Thu, Jan 23, 2014 at 5:28 PM, Andrew Ash <> wrote:
You're thinking like this? => (v,None)).join( => (v,None))).map(_._2)

On Thu, Jan 23, 2014 at 6:26 PM, Evan R. Sparks <> wrote:
You could map each to an RDD[(String,None)] and do a join.

On Thu, Jan 23, 2014 at 5:18 PM, Andrew Ash <> wrote:
Hi spark users,

I recently wanted to calculate the set intersection of two RDDs of Strings.  I couldn't find a .intersection() method in the autocomplete or in the Scala API docs, so used a little set theory to end up with this:

lazy val A = ...
lazy val B = ...

Which feels very cumbersome.

Does anyone have a more idiomatic way to calculate intersection?