spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: .intersection() method on RDDs?
Date Fri, 24 Jan 2014 06:30:37 GMT
You mean cogroup like this?

A.map(v => (v,None)).cogroup(B.map(v => (v,None))).keys

If so I might send a PR to start code review for getting this into master.

Good to know about the strategy for sharding RDDs and for the core
operations.

Thanks!
Andrew


On Thu, Jan 23, 2014 at 11:17 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> Using cogroup would probably be slightly more efficient than join because
> you don’t have to generate every pair of keys for elements that occur in
> each dataset multiple times.
>
> We haven’t tried to explicitly separate the API between “core” methods and
> others, but in practice, everything can be built on mapPartitions and
> cogroup for transformations, and SparkContext.runJob (internal method) for
> actions. What really matters is actually the level at which the code sees
> dependencies in the DAGScheduler, which is done through the Dependency
> class. There are only two types of dependencies (narrow and shuffle), which
> correspond to those operations above. So in a sense there is this
> separation at the lowest level. But for the levels above, the goal was
> first and foremost to make the API as usable as possible, which meant
> giving people quick access to all the operations that might be useful, and
> dealing with how we’ll implement those later. Over time it will be possible
> to divide things like RDD.scala into multiple traits if they become
> unwieldy.
>
> Matei
>
>
> On Jan 23, 2014, at 9:40 PM, Andrew Ash <andrew@andrewash.com> wrote:
>
> And I think the followup to Ian's question:
>
> Is there a way to implement .intersect() in the core API that's more
> efficient than the .join() method Evan suggested?
>
> Andrew
>
>
> On Thu, Jan 23, 2014 at 10:26 PM, Ian O'Connell <ian@ianoconnell.com>wrote:
>
>> Is there any separation in the API between functions that can be built
>> solely on the existing exposed public API and ones which require access to
>> internals?
>>
>> Just to maybe avoid bloat for composite functions like this that are for
>> user convenience?
>>
>> (Ala something like lua's aux api vs core api?)
>>
>>
>> On Thu, Jan 23, 2014 at 8:33 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:
>>
>>> I’d be happy to see this added to the core API.
>>>
>>> Matei
>>>
>>> On Jan 23, 2014, at 5:39 PM, Andrew Ash <andrew@andrewash.com> wrote:
>>>
>>> Ah right of course -- perils of typing code without running it!
>>>
>>> It feels like this is a pretty core operation that should be added to
>>> the main RDD API.  Do other people not run into this often?
>>>
>>> When I'm validating a foreign key join in my cluster I often check to
>>> make sure that the foreign keys land on valid values on the referenced
>>> table, and the way I do that is checking to see what percentage of the
>>> references actually land.
>>>
>>>
>>> On Thu, Jan 23, 2014 at 6:36 PM, Evan R. Sparks <evan.sparks@gmail.com>wrote:
>>>
>>>> Yup (well, with _._1 at the end!)
>>>>
>>>>
>>>> On Thu, Jan 23, 2014 at 5:28 PM, Andrew Ash <andrew@andrewash.com>wrote:
>>>>
>>>>> You're thinking like this?
>>>>>
>>>>> A.map(v => (v,None)).join(B.map(v => (v,None))).map(_._2)
>>>>>
>>>>>
>>>>> On Thu, Jan 23, 2014 at 6:26 PM, Evan R. Sparks <evan.sparks@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> You could map each to an RDD[(String,None)] and do a join.
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 23, 2014 at 5:18 PM, Andrew Ash <andrew@andrewash.com>wrote:
>>>>>>
>>>>>>> Hi spark users,
>>>>>>>
>>>>>>> I recently wanted to calculate the set intersection of two RDDs
of
>>>>>>> Strings.  I couldn't find a .intersection() method in the autocomplete
or
>>>>>>> in the Scala API docs, so used a little set theory to end up
with this:
>>>>>>>
>>>>>>> lazy val A = ...
>>>>>>> lazy val B = ...
>>>>>>> A.union(B).subtract(A.subtract(B)).subtract(B.subtract(A))
>>>>>>>
>>>>>>> Which feels very cumbersome.
>>>>>>>
>>>>>>> Does anyone have a more idiomatic way to calculate intersection?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Andrew
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>

Mime
View raw message