spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: .intersection() method on RDDs?
Date Fri, 24 Jan 2014 05:40:41 GMT
And I think the followup to Ian's question:

Is there a way to implement .intersect() in the core API that's more
efficient than the .join() method Evan suggested?

Andrew


On Thu, Jan 23, 2014 at 10:26 PM, Ian O'Connell <ian@ianoconnell.com> wrote:

> Is there any separation in the API between functions that can be built
> solely on the existing exposed public API and ones which require access to
> internals?
>
> Just to maybe avoid bloat for composite functions like this that are for
> user convenience?
>
> (Ala something like lua's aux api vs core api?)
>
>
> On Thu, Jan 23, 2014 at 8:33 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:
>
>> I’d be happy to see this added to the core API.
>>
>> Matei
>>
>> On Jan 23, 2014, at 5:39 PM, Andrew Ash <andrew@andrewash.com> wrote:
>>
>> Ah right of course -- perils of typing code without running it!
>>
>> It feels like this is a pretty core operation that should be added to the
>> main RDD API.  Do other people not run into this often?
>>
>> When I'm validating a foreign key join in my cluster I often check to
>> make sure that the foreign keys land on valid values on the referenced
>> table, and the way I do that is checking to see what percentage of the
>> references actually land.
>>
>>
>> On Thu, Jan 23, 2014 at 6:36 PM, Evan R. Sparks <evan.sparks@gmail.com>wrote:
>>
>>> Yup (well, with _._1 at the end!)
>>>
>>>
>>> On Thu, Jan 23, 2014 at 5:28 PM, Andrew Ash <andrew@andrewash.com>wrote:
>>>
>>>> You're thinking like this?
>>>>
>>>> A.map(v => (v,None)).join(B.map(v => (v,None))).map(_._2)
>>>>
>>>>
>>>> On Thu, Jan 23, 2014 at 6:26 PM, Evan R. Sparks <evan.sparks@gmail.com>wrote:
>>>>
>>>>> You could map each to an RDD[(String,None)] and do a join.
>>>>>
>>>>>
>>>>> On Thu, Jan 23, 2014 at 5:18 PM, Andrew Ash <andrew@andrewash.com>wrote:
>>>>>
>>>>>> Hi spark users,
>>>>>>
>>>>>> I recently wanted to calculate the set intersection of two RDDs of
>>>>>> Strings.  I couldn't find a .intersection() method in the autocomplete
or
>>>>>> in the Scala API docs, so used a little set theory to end up with
this:
>>>>>>
>>>>>> lazy val A = ...
>>>>>> lazy val B = ...
>>>>>> A.union(B).subtract(A.subtract(B)).subtract(B.subtract(A))
>>>>>>
>>>>>> Which feels very cumbersome.
>>>>>>
>>>>>> Does anyone have a more idiomatic way to calculate intersection?
>>>>>>
>>>>>> Thanks!
>>>>>> Andrew
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Mime
View raw message