spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Sargisson <ejsa...@gmail.com>
Subject Using groupByKey with Spark SQL
Date Fri, 15 May 2015 17:48:25 GMT
Hi all,
This might be a question to be answered or feedback for a possibly new
feature depending:

We have source data which is events about the state changes of an entity
(identified by an ID) represented as nested JSON.
We wanted to sessionize this data so that we had a collection of all the
events for a given ID as we have to do more processing based on what we
find.

We tried doing this using Spark SQL and then converting to a JavaPairRDD
using DataFrame.javaRdd.groupByKey.

The schema inference worked great but what was frustrating was that the
result of groupByKey is <String, Iterable<Row>>. Rows only have get(int)
methods and don't take notice of the schema stuff so they ended up being
something we didn't want to work with.

We are currently solving this problem by ignoring Spark SQL and
deserializing the event JSON into a POJO for further processing.

Are there better approaches to this?
Perhaps Spark should have a DataFrame.groupByKey that returns Rows that can
be used with the schema stuff?

Thanks!
Edward

Mime
View raw message