spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From macwanjason <>
Subject How to reshape RDD/Spark DataFrame
Date Sat, 16 May 2015 03:52:58 GMT
Hi all,

I am a student trying to learn Spark and I had a question regarding
converting rows to columns (data pivot/reshape). I have some data in the
following format (either RDD or Spark DataFrame):

    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)

     rdd = sc.parallelize([('X01',41,'US',3),
    # convert to a Spark DataFrame                    
    schema = StructType([StructField('ID', StringType(), True),
                         StructField('Age', IntegerType(), True),
                         StructField('Country', StringType(), True),
                         StructField('Score', IntegerType(), True)])
    df = sqlContext.createDataFrame(rdd, schema)

What I would like to do is to 'reshape' the data, convert certain rows in
Country(specifically US, UK and CA) into columns:

    ID    Age  US  UK  CA  
    'X01'  41   3    1    2  
    'X02'  72   4    6    7   

Essentially, I need something along the lines of Python's `pivot` workflow:

    categories = ['US', 'UK', 'CA']
    new_df = df[df['Country'].isin(categories)].pivot(index = 'ID', 
                                                      columns = 'Country',
                                                      values = 'Score')

My dataset is rather large so I can't really `collect()` and ingest the data
into memory to do the reshaping in Python itself. Is there a way to convert
Python's `.pivot()` into an invokable function while mapping either an RDD
or a Spark DataFrame? Any help would be appreciated!

I had initially posted this question on Stack Overflow  here
but the one suggestion solution is verbose and error prone and probably not
scalable either. 

Any help would be greatly appreciated. 

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message