spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christos Kozanitis <kozani...@berkeley.edu>
Subject Re: SparkSQL extensions
Date Sun, 27 Jul 2014 08:07:19 GMT
Thanks Michael for the recommendations. Actually the region-join (or I
could name it range-join or interval-join) that I was thinking should join
the entries of two tables with inequality predicates. For example if table
A(col1 int, col2 int) contains entries (1,4) and (10,12) and table b(c1
int, c2 int) contains entries (3,6) and (43,23) then the region-join of A,
B on (col1 < c1 and c2 < col2) should produce the tuple(1,4,3,6).

Does it make sense?

Actually there is a JIRA on a similar topic for Hive here:
https://issues.apache.org/jira/browse/HIVE-556

Also ADAM implements region-joins here:
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/RegionJoin.scala

I was thinking to provide an improved version of method "partitionAndJoin"
from the ADAM implementation above



On Sat, Jul 26, 2014 at 12:37 PM, Michael Armbrust <michael@databricks.com>
wrote:

> A very simple example of adding a new operator to Spark SQL:
> https://github.com/apache/spark/pull/1366
> An example of adding a new type of join to Spark SQL:
> https://github.com/apache/spark/pull/837
>
> Basically, you will need to add a new physical operator that inherits from
> SparkPlan and a Strategy that causes the query planner to select it.  Maybe
> you can explain a little more what you mean by region-join?  If its only a
> different algorithm, and not a logically different type of join, then you
> will not need to make some of he logical modifications that the second PR
> did.
>
> Often the hardest part here is going to be figuring out when to use one
> join over another.  Right now the rules are pretty straightforward: The
> joins that are picked first are the most efficient but only handle certain
> cases (inner joins with equality predicates).  When that is not the case it
> falls back on slower, but more general operators.  If there are more subtle
> trade offs involved then we may need to wait until we have more statistics
> to help us make the choice.
>
> I'd suggest opening a JIRA and proposing a design before going too far.
>
> Michael
>
>
> On Sat, Jul 26, 2014 at 3:32 AM, Christos Kozanitis <
> kozanitis@berkeley.edu> wrote:
>
>> Hello
>>
>> I was wondering is it easy for you guys to point me to what modules I
>> need to update if I had to add extra functionality to sparkSQL?
>>
>> I was thinking to implement a region-join operator and I guess I should
>> add the implementation details under joins.scala but what else do I need to
>> modify?
>>
>> thanks
>> Christos
>>
>
>

Mime
View raw message