spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: feedback on dataset api explode
Date Wed, 25 May 2016 18:55:13 GMT
These APIs predate Datasets / encoders, so that is why they are Row instead
of objects.  We should probably rethink that.

Honestly, I usually end up using the column expression version of explode
now that it exists (i.e. explode($"arrayCol").as("Item")).  It would be
great to understand more why you are using these instead.

On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <koert@tresata.com> wrote:

> we currently have 2 explode definitions in Dataset:
>
>  def explode[A <: Product : TypeTag](input: Column*)(f: Row =>
> TraversableOnce[A]): DataFrame
>
>  def explode[A, B : TypeTag](inputColumn: String, outputColumn: String)(f:
> A => TraversableOnce[B]): DataFrame
>
> 1) the separation of the functions into their own argument lists is nice,
> but unfortunately scala's type inference doesn't handle this well, meaning
> that the generic types always have to be explicitly provided. i assume this
> was done to allow the "input" to be a varargs in the first method, and then
> kept the same in the second for reasons of symmetry.
>
> 2) i am surprised the first definition returns a DataFrame. this seems to
> suggest DataFrame usage (so DataFrame to DataFrame), but there is no way to
> specify the output column names, which limits its usability for DataFrames.
> i frequently end up using the first definition for DataFrames anyhow
> because of the need to return more than 1 column (and the data has columns
> unknown at compile time that i need to carry along making flatMap on
> Dataset clumsy/unusable), but relying on the output columns being called _1
> and _2 and renaming then afterwards seems like an anti-pattern.
>
> 3) using Row objects isn't very pretty. why not f: A => TraversableOnce[B]
> or something like that for the first definition? how about:
>  def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output:
> Seq[Column])(f: A => TraversableOnce[B]): DataFrame
>
> best,
> koert
>

Mime
View raw message