I've been working on transitioning from RDD to Datasets in our codebase in anticipation of being able to leverage features of 2.0.

I'm having a lot of difficulties with the impedance mismatches between how outer joins worked with RDD versus Dataset. The Dataset joins feel like a big step backwards IMO. With RDD, leftOuterJoin would give you Option types of the results from the right side of the join. This follows idiomatic Scala avoiding nulls and was easy to work with.

Now with Dataset there is only joinWith where you specify the join type, but it lost all the semantics of identifying missing data from outer joins. I can write some enriched methods on Dataset with an implicit class to abstract messiness away if Dataset nulled out all mismatching data from an outer join, however the problem goes even further in that the values aren't always null. Integer, for example, defaults to -1 instead of null. Now it's completely ambiguous what data in the join was actually there versus populated via this atypical semantic.

Are there additional options available to work around this issue? I can convert to RDD and back to Dataset but that's less than ideal.

Richard Marscher
Senior Software Engineer
Localytics.com | Our Blog | Twitter | Facebook | LinkedIn