spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <>
Subject Re: Dataset Outer Join vs RDD Outer Join
Date Wed, 01 Jun 2016 17:42:24 GMT
Thanks for the feedback.  I think this will address at least some of the
problems you are describing:

On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <>

> Hi,
> I've been working on transitioning from RDD to Datasets in our codebase in
> anticipation of being able to leverage features of 2.0.
> I'm having a lot of difficulties with the impedance mismatches between how
> outer joins worked with RDD versus Dataset. The Dataset joins feel like a
> big step backwards IMO. With RDD, leftOuterJoin would give you Option types
> of the results from the right side of the join. This follows idiomatic
> Scala avoiding nulls and was easy to work with.
> Now with Dataset there is only joinWith where you specify the join type,
> but it lost all the semantics of identifying missing data from outer joins.
> I can write some enriched methods on Dataset with an implicit class to
> abstract messiness away if Dataset nulled out all mismatching data from an
> outer join, however the problem goes even further in that the values aren't
> always null. Integer, for example, defaults to -1 instead of null. Now it's
> completely ambiguous what data in the join was actually there versus
> populated via this atypical semantic.
> Are there additional options available to work around this issue? I can
> convert to RDD and back to Dataset but that's less than ideal.
> Thanks,
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> <> | Our Blog
> <> | Twitter <> |
> Facebook <> | LinkedIn
> <>

View raw message