spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Marscher <>
Subject Dataset Outer Join vs RDD Outer Join
Date Wed, 01 Jun 2016 16:58:08 GMT

I've been working on transitioning from RDD to Datasets in our codebase in
anticipation of being able to leverage features of 2.0.

I'm having a lot of difficulties with the impedance mismatches between how
outer joins worked with RDD versus Dataset. The Dataset joins feel like a
big step backwards IMO. With RDD, leftOuterJoin would give you Option types
of the results from the right side of the join. This follows idiomatic
Scala avoiding nulls and was easy to work with.

Now with Dataset there is only joinWith where you specify the join type,
but it lost all the semantics of identifying missing data from outer joins.
I can write some enriched methods on Dataset with an implicit class to
abstract messiness away if Dataset nulled out all mismatching data from an
outer join, however the problem goes even further in that the values aren't
always null. Integer, for example, defaults to -1 instead of null. Now it's
completely ambiguous what data in the join was actually there versus
populated via this atypical semantic.

Are there additional options available to work around this issue? I can
convert to RDD and back to Dataset but that's less than ideal.

*Richard Marscher*
Senior Software Engineer
Localytics <> | Our Blog
<> | Twitter <> |
Facebook <> | LinkedIn

View raw message