spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: .intersection() method on RDDs?
Date Fri, 24 Jan 2014 04:33:10 GMT
I’d be happy to see this added to the core API.

Matei

On Jan 23, 2014, at 5:39 PM, Andrew Ash <andrew@andrewash.com> wrote:

> Ah right of course -- perils of typing code without running it!
> 
> It feels like this is a pretty core operation that should be added to the main RDD API.
 Do other people not run into this often?
> 
> When I'm validating a foreign key join in my cluster I often check to make sure that
the foreign keys land on valid values on the referenced table, and the way I do that is checking
to see what percentage of the references actually land.
> 
> 
> On Thu, Jan 23, 2014 at 6:36 PM, Evan R. Sparks <evan.sparks@gmail.com> wrote:
> Yup (well, with _._1 at the end!)
> 
> 
> On Thu, Jan 23, 2014 at 5:28 PM, Andrew Ash <andrew@andrewash.com> wrote:
> You're thinking like this?
> 
> A.map(v => (v,None)).join(B.map(v => (v,None))).map(_._2)
> 
> 
> On Thu, Jan 23, 2014 at 6:26 PM, Evan R. Sparks <evan.sparks@gmail.com> wrote:
> You could map each to an RDD[(String,None)] and do a join.
> 
> 
> On Thu, Jan 23, 2014 at 5:18 PM, Andrew Ash <andrew@andrewash.com> wrote:
> Hi spark users,
> 
> I recently wanted to calculate the set intersection of two RDDs of Strings.  I couldn't
find a .intersection() method in the autocomplete or in the Scala API docs, so used a little
set theory to end up with this:
> 
> lazy val A = ...
> lazy val B = ...
> A.union(B).subtract(A.subtract(B)).subtract(B.subtract(A))
> 
> Which feels very cumbersome.
> 
> Does anyone have a more idiomatic way to calculate intersection?
> 
> Thanks!
> Andrew
> 
> 
> 
> 


Mime
View raw message