spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandip Mehta <sandip.mehta....@gmail.com>
Subject Re: Row Encoder For DataSet
Date Fri, 08 Dec 2017 05:19:59 GMT
Hi,

I want to group on certain columns and then for every group wants to apply
custom UDF function to it. Currently groupBy only allows to add aggregation
function to GroupData.

For this was thinking to use groupByKey which will return KeyValueDataSet
and then apply UDF for every group but really not been able solve this.

SM

On Fri, Dec 8, 2017 at 10:29 AM Weichen Xu <weichen.xu@databricks.com>
wrote:

> You can groupBy multiple columns on dataframe, so why you need so
> complicated schema ?
>
> suppose df schema: (x, y, u, v, z)
>
> df.groupBy($"x", $"y").agg(...)
>
> Is this you want ?
>
> On Fri, Dec 8, 2017 at 11:51 AM, Sandip Mehta <sandip.mehta.sub@gmail.com>
> wrote:
>
>> Hi,
>>
>> During my aggregation I end up having following schema.
>>
>> Row(Row(val1,val2), Row(val1,val2,val3...))
>>
>> val values = Seq(
>>     (Row(10, 11), Row(10, 2, 11)),
>>     (Row(10, 11), Row(10, 2, 11)),
>>     (Row(20, 11), Row(10, 2, 11))
>>   )
>>
>>
>> 1st tuple is used to group the relevant records for aggregation. I have
>> used following to create dataset.
>>
>> val s = StructType(Seq(
>>   StructField("x", IntegerType, true),
>>   StructField("y", IntegerType, true)
>> ))
>> val s1 = StructType(Seq(
>>   StructField("u", IntegerType, true),
>>   StructField("v", IntegerType, true),
>>   StructField("z", IntegerType, true)
>> ))
>>
>> val ds = sparkSession.sqlContext.createDataset(sparkSession.sparkContext.parallelize(values))(Encoders.tuple(RowEncoder(s),
RowEncoder(s1)))
>>
>> Is this correct way of representing this?
>>
>> How do I create dataset and row encoder for such use case for doing
>> groupByKey on this?
>>
>>
>>
>> Regards
>> Sandeep
>>
>
>

Mime
View raw message