spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weichen Xu <weichen...@databricks.com>
Subject Re: Row Encoder For DataSet
Date Fri, 08 Dec 2017 04:59:40 GMT
You can groupBy multiple columns on dataframe, so why you need so
complicated schema ?

suppose df schema: (x, y, u, v, z)

df.groupBy($"x", $"y").agg(...)

Is this you want ?

On Fri, Dec 8, 2017 at 11:51 AM, Sandip Mehta <sandip.mehta.sub@gmail.com>
wrote:

> Hi,
>
> During my aggregation I end up having following schema.
>
> Row(Row(val1,val2), Row(val1,val2,val3...))
>
> val values = Seq(
>     (Row(10, 11), Row(10, 2, 11)),
>     (Row(10, 11), Row(10, 2, 11)),
>     (Row(20, 11), Row(10, 2, 11))
>   )
>
>
> 1st tuple is used to group the relevant records for aggregation. I have
> used following to create dataset.
>
> val s = StructType(Seq(
>   StructField("x", IntegerType, true),
>   StructField("y", IntegerType, true)
> ))
> val s1 = StructType(Seq(
>   StructField("u", IntegerType, true),
>   StructField("v", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
>
> val ds = sparkSession.sqlContext.createDataset(sparkSession.sparkContext.parallelize(values))(Encoders.tuple(RowEncoder(s),
RowEncoder(s1)))
>
> Is this correct way of representing this?
>
> How do I create dataset and row encoder for such use case for doing
> groupByKey on this?
>
>
>
> Regards
> Sandeep
>

Mime
View raw message