spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandip Mehta <sandip.mehta....@gmail.com>
Subject Row Encoder For DataSet
Date Fri, 08 Dec 2017 03:51:58 GMT
Hi,

During my aggregation I end up having following schema.

Row(Row(val1,val2), Row(val1,val2,val3...))

val values = Seq(
    (Row(10, 11), Row(10, 2, 11)),
    (Row(10, 11), Row(10, 2, 11)),
    (Row(20, 11), Row(10, 2, 11))
  )


1st tuple is used to group the relevant records for aggregation. I have
used following to create dataset.

val s = StructType(Seq(
  StructField("x", IntegerType, true),
  StructField("y", IntegerType, true)
))
val s1 = StructType(Seq(
  StructField("u", IntegerType, true),
  StructField("v", IntegerType, true),
  StructField("z", IntegerType, true)
))

val ds = sparkSession.sqlContext.createDataset(sparkSession.sparkContext.parallelize(values))(Encoders.tuple(RowEncoder(s),
RowEncoder(s1)))

Is this correct way of representing this?

How do I create dataset and row encoder for such use case for doing
groupByKey on this?



Regards
Sandeep

Mime
View raw message