spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <>
Subject Re: frustration with field names in Dataset
Date Thu, 02 Feb 2017 15:41:58 GMT
another example is if i have a Dataset[(K, V)] and i want to repartition it
by the key K. repartition requires a Column which means i am suddenly back
to worrying about duplicate field names. i would like to be able to say:


On Thu, Feb 2, 2017 at 10:39 AM, Koert Kuipers <> wrote:

> since a dataset is a typed object you ideally don't have to think about
> field names.
> however there are operations on Dataset that require you to provide a
> Column, like for example joinWith (and joinWith returns a strongly typed
> Dataset, not DataFrame). once you have to provide a Column you are back to
> thinking in field names, and worrying about duplicate field names, which is
> something that can easily happen in a Dataset without you realizing it.
> so under the hood Dataset has unique identifiers for every column, as in
> dataset.queryExecution.logical.output, but these are expressions
> (attributes) that i cannot turn back into columns since the constructors
> for this are private in spark.
> so.... how about having Dataset.apply(i: Int): Column to allow me to pick
> columns by position without having to worry about (duplicate) field names?
> then i could do something like:
> dataset.joinWith(otherDataset, dataset(0) === otherDataset(0), joinType)

View raw message