spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Marscher <>
Subject Re: Dataset Outer Join vs RDD Outer Join
Date Wed, 01 Jun 2016 18:59:28 GMT
Ah thanks, I missed seeing the PR for If the rows became null
objects then I can implement methods that will map those back to results
that align closer to the RDD interface.

As a follow on, I'm curious about thoughts regarding enriching the Dataset
join interface versus a package or users sugaring for themselves. I haven't
considered the implications of what the optimizations datasets, tungsten,
and/or bytecode gen can do now regarding joins so I may be missing a
critical benefit there around say avoiding Options in favor of nulls. If
nothing else, I guess Option doesn't have a first class Encoder or DataType
yet and maybe for good reasons.

I did find the RDD join interface elegant, though. In the ideal world an
API comparable the following would be nice:

On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <>

> Thanks for the feedback.  I think this will address at least some of the
> problems you are describing:
> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
> > wrote:
>> Hi,
>> I've been working on transitioning from RDD to Datasets in our codebase
>> in anticipation of being able to leverage features of 2.0.
>> I'm having a lot of difficulties with the impedance mismatches between
>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like
>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option
>> types of the results from the right side of the join. This follows
>> idiomatic Scala avoiding nulls and was easy to work with.
>> Now with Dataset there is only joinWith where you specify the join type,
>> but it lost all the semantics of identifying missing data from outer joins.
>> I can write some enriched methods on Dataset with an implicit class to
>> abstract messiness away if Dataset nulled out all mismatching data from an
>> outer join, however the problem goes even further in that the values aren't
>> always null. Integer, for example, defaults to -1 instead of null. Now it's
>> completely ambiguous what data in the join was actually there versus
>> populated via this atypical semantic.
>> Are there additional options available to work around this issue? I can
>> convert to RDD and back to Dataset but that's less than ideal.
>> Thanks,
>> --
>> *Richard Marscher*
>> Senior Software Engineer
>> Localytics
>> <> | Our Blog
>> <> | Twitter <>
>> Facebook <> | LinkedIn
>> <>

*Richard Marscher*
Senior Software Engineer
Localytics <> | Our Blog
<> | Twitter <> |
Facebook <> | LinkedIn

View raw message