spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Sparks <evan.spa...@gmail.com>
Subject Re: .intersection() method on RDDs?
Date Fri, 24 Jan 2014 06:50:13 GMT
If the intersection is really big, would join be better?

Agreed on "null" vs None -but how frequent is this in the current codebase?

> On Jan 23, 2014, at 10:38 PM, Matei Zaharia <matei.zaharia@gmail.com> wrote:
> 
> You’d have to add a filter after the cogroup too. Cogroup gives you (key, (list of
values in RDD 1, list in RDD 2)).
> 
> Also one small thing, instead of setting the value to None, it may be cheaper to use
null.
> 
> Matei
> 
>> On Jan 23, 2014, at 10:30 PM, Andrew Ash <andrew@andrewash.com> wrote:
>> 
>> You mean cogroup like this?
>> 
>> A.map(v => (v,None)).cogroup(B.map(v => (v,None))).keys
>> 
>> If so I might send a PR to start code review for getting this into master.
>> 
>> Good to know about the strategy for sharding RDDs and for the core operations.
>> 
>> Thanks!
>> Andrew
>> 
>> 
>>> On Thu, Jan 23, 2014 at 11:17 PM, Matei Zaharia <matei.zaharia@gmail.com>
wrote:
>>> Using cogroup would probably be slightly more efficient than join because you
don’t have to generate every pair of keys for elements that occur in each dataset multiple
times.
>>> 
>>> We haven’t tried to explicitly separate the API between “core” methods
and others, but in practice, everything can be built on mapPartitions and cogroup for transformations,
and SparkContext.runJob (internal method) for actions. What really matters is actually the
level at which the code sees dependencies in the DAGScheduler, which is done through the Dependency
class. There are only two types of dependencies (narrow and shuffle), which correspond to
those operations above. So in a sense there is this separation at the lowest level. But for
the levels above, the goal was first and foremost to make the API as usable as possible, which
meant giving people quick access to all the operations that might be useful, and dealing with
how we’ll implement those later. Over time it will be possible to divide things like RDD.scala
into multiple traits if they become unwieldy.
>>> 
>>> Matei
>>> 
>>> 
>>>> On Jan 23, 2014, at 9:40 PM, Andrew Ash <andrew@andrewash.com> wrote:
>>>> 
>>>> And I think the followup to Ian's question:
>>>> 
>>>> Is there a way to implement .intersect() in the core API that's more efficient
than the .join() method Evan suggested?
>>>> 
>>>> Andrew
>>>> 
>>>> 
>>>>> On Thu, Jan 23, 2014 at 10:26 PM, Ian O'Connell <ian@ianoconnell.com>
wrote:
>>>>> Is there any separation in the API between functions that can be built
solely on the existing exposed public API and ones which require access to internals?
>>>>> 
>>>>> Just to maybe avoid bloat for composite functions like this that are
for user convenience?
>>>>> 
>>>>> (Ala something like lua's aux api vs core api?)
>>>>> 
>>>>> 
>>>>>> On Thu, Jan 23, 2014 at 8:33 PM, Matei Zaharia <matei.zaharia@gmail.com>
wrote:
>>>>>> I’d be happy to see this added to the core API.
>>>>>> 
>>>>>> Matei
>>>>>> 
>>>>>>> On Jan 23, 2014, at 5:39 PM, Andrew Ash <andrew@andrewash.com>
wrote:
>>>>>>> 
>>>>>>> Ah right of course -- perils of typing code without running it!
>>>>>>> 
>>>>>>> It feels like this is a pretty core operation that should be
added to the main RDD API.  Do other people not run into this often?
>>>>>>> 
>>>>>>> When I'm validating a foreign key join in my cluster I often
check to make sure that the foreign keys land on valid values on the referenced table, and
the way I do that is checking to see what percentage of the references actually land.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Thu, Jan 23, 2014 at 6:36 PM, Evan R. Sparks <evan.sparks@gmail.com>
wrote:
>>>>>>>> Yup (well, with _._1 at the end!)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Thu, Jan 23, 2014 at 5:28 PM, Andrew Ash <andrew@andrewash.com>
wrote:
>>>>>>>>> You're thinking like this?
>>>>>>>>> 
>>>>>>>>> A.map(v => (v,None)).join(B.map(v => (v,None))).map(_._2)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Thu, Jan 23, 2014 at 6:26 PM, Evan R. Sparks <evan.sparks@gmail.com>
wrote:
>>>>>>>>>> You could map each to an RDD[(String,None)] and do
a join.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Thu, Jan 23, 2014 at 5:18 PM, Andrew Ash <andrew@andrewash.com>
wrote:
>>>>>>>>>>> Hi spark users,
>>>>>>>>>>> 
>>>>>>>>>>> I recently wanted to calculate the set intersection
of two RDDs of Strings.  I couldn't find a .intersection() method in the autocomplete or in
the Scala API docs, so used a little set theory to end up with this:
>>>>>>>>>>> 
>>>>>>>>>>> lazy val A = ...
>>>>>>>>>>> lazy val B = ...
>>>>>>>>>>> A.union(B).subtract(A.subtract(B)).subtract(B.subtract(A))
>>>>>>>>>>> 
>>>>>>>>>>> Which feels very cumbersome.
>>>>>>>>>>> 
>>>>>>>>>>> Does anyone have a more idiomatic way to calculate
intersection?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> Andrew
> 

Mime
View raw message