spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michał Zieliński <zielinski.mich...@gmail.com>
Subject Re: Flattening Data within DataFrames
Date Mon, 29 Feb 2016 17:40:06 GMT
Hi Kevin,

This should help:
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html

On 29 February 2016 at 16:54, Kevin Mellott <kevin.r.mellott@gmail.com>
wrote:

> Fellow Sparkers,
>
> I'm trying to "flatten" my view of data within a DataFrame, and am having
> difficulties doing so. The DataFrame contains product information, which
> includes multiple levels of categories (primary, secondary, etc).
>
> *Example Data (Raw):*
> *Name                    Level            Category*
> Baked Code            Food             1
> Baked Code            Seafood         2
> Baked Code            Fish               3
> Hockey Stick          Sports            1
> Hockey Stick          Hockey          2
> Hockey Stick          Equipment      3
>
> *Desired Data:*
> *Name                    Category1     Category2     Category3*
> Baked Code            Food              Seafood         Fish
> Hockey Stick          Sports            Hockey          Equipment
>
> *Approach:*
> After parsing the "raw" information into two separate DataFrames (called *products
> *and *categories*) and registering them as a Spark SQL tables, I was
> attempting to perform the following query to flatten this all into the
> "desired data" (depicted above).
>
> products.registerTempTable("products")
> categories.registerTempTable("categories")
>
> val productList = sqlContext.sql(
>   " SELECT p.Name, " +
>   " c1.Description AS Category1, " +
>   " c2.Description AS Category2, " +
>   " c3.Description AS Category3 " +
>   " FROM products AS p " +
>   "   JOIN categories AS c1 " +
>   "     ON c1.Name = p.Name AND c1.Level = '1' "
>   "   JOIN categories AS c2 " +
>   "     ON c2.Name = p.Name AND c2.Level = '2' "
>   "   JOIN categories AS c3 " +
>   "     ON c3.Name = p.Name AND c3.Level = '3' "
>
> *Issue:*
> I get an error when running my query above, because I am not able to JOIN
> the *categories* table more than once. Has anybody dealt with this type
> of use case before, and if so how did you achieve the desired behavior?
>
> Thank you in advance for your thoughts.
>
> Kevin
>

Mime
View raw message