spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Marscher <rmarsc...@localytics.com>
Subject Re: Dataset Outer Join vs RDD Outer Join
Date Wed, 01 Jun 2016 18:59:28 GMT
Ah thanks, I missed seeing the PR for
https://issues.apache.org/jira/browse/SPARK-15441. If the rows became null
objects then I can implement methods that will map those back to results
that align closer to the RDD interface.

As a follow on, I'm curious about thoughts regarding enriching the Dataset
join interface versus a package or users sugaring for themselves. I haven't
considered the implications of what the optimizations datasets, tungsten,
and/or bytecode gen can do now regarding joins so I may be missing a
critical benefit there around say avoiding Options in favor of nulls. If
nothing else, I guess Option doesn't have a first class Encoder or DataType
yet and maybe for good reasons.

I did find the RDD join interface elegant, though. In the ideal world an
API comparable the following would be nice:
https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06


On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <michael@databricks.com>
wrote:

> Thanks for the feedback.  I think this will address at least some of the
> problems you are describing: https://github.com/apache/spark/pull/13425
>
> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <rmarscher@localytics.com
> > wrote:
>
>> Hi,
>>
>> I've been working on transitioning from RDD to Datasets in our codebase
>> in anticipation of being able to leverage features of 2.0.
>>
>> I'm having a lot of difficulties with the impedance mismatches between
>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like
>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option
>> types of the results from the right side of the join. This follows
>> idiomatic Scala avoiding nulls and was easy to work with.
>>
>> Now with Dataset there is only joinWith where you specify the join type,
>> but it lost all the semantics of identifying missing data from outer joins.
>> I can write some enriched methods on Dataset with an implicit class to
>> abstract messiness away if Dataset nulled out all mismatching data from an
>> outer join, however the problem goes even further in that the values aren't
>> always null. Integer, for example, defaults to -1 instead of null. Now it's
>> completely ambiguous what data in the join was actually there versus
>> populated via this atypical semantic.
>>
>> Are there additional options available to work around this issue? I can
>> convert to RDD and back to Dataset but that's less than ideal.
>>
>> Thanks,
>> --
>> *Richard Marscher*
>> Senior Software Engineer
>> Localytics
>> Localytics.com <http://localytics.com/> | Our Blog
>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
|
>> Facebook <http://facebook.com/localytics> | LinkedIn
>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>
>
>


-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Mime
View raw message