spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewan Leith <ewan.le...@realitymine.com>
Subject RE: Selecting different levels of nested data records during one select?
Date Thu, 27 Aug 2015 10:52:45 GMT
I've just come across https://forums.databricks.com/questions/893/how-do-i-explode-a-dataframe-column-containing-a-c.html

Which appears to get us started using explode on nested datasets as arrays correctly, thanks.

Ewan

From: Ewan Leith [mailto:ewan.leith@realitymine.com]
Sent: 27 August 2015 10:09
To: user@spark.apache.org
Subject: Selecting different levels of nested data records during one select?

Hello,

I'm trying to query a nested data record of the form:

root
|-- userid: string (nullable = true)
|-- datarecords: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- name: string (nullable = true)
|    |    |-- system: boolean (nullable = true)
|    |    |-- time: string (nullable = true)
|    |    |-- title: string (nullable = true)

Where for each "userid" record, there are many "datarecords" elements.

I'd like to be able to run the SQL equivalent of:

"select userid, name, system, time, title"

and get 1 output row per nested row, each one containing the matching userid for that row
(if that makes sense!).

the "explode" function seemed like the place to start, but it seems I have to call it individually
for each nested column, then I end up with a huge number of results based on a Cartesian join?

Is anyone able to point me in the right direction?

Thanks,
Ewan



Mime
View raw message