spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henrique Oliveira <heso...@gmail.com>
Subject [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.
Date Sun, 02 Aug 2020 00:07:09 GMT
I have a PySpark method that applies the explode function on every Array
column on the DataFrame.

def explode_column(df, column):
    select_cols = list(df.columns)
    col_position = select_cols.index(column)
    select_cols[col_position] = explode_outer(column).alias(column)
    return df.select(select_cols)

def explode_all_arrays(df):
    still_has_arrays = True
    exploded_df = df

    while still_has_arrays:
        still_has_arrays = False
        for f in exploded_df.schema.fields:
            if isinstance(f.dataType, ArrayType):
                print(f"Exploding: {f}")
                still_has_arrays = True
                exploded_df = explode_column(exploded_df, f.name)

    return exploded_df

When I have a small number of columns to explode it works perfectly,
but on large DataFrames (~200 columns with ~40 explosions), after
finishing, the DataFrame can't be written as Parquet.

Even a small amount of data fails (400KB), not during the method
processing, but on the write step.

Any tips? I tried to write the dataframe as table and as a parquet
file. It works when I store it as a temp view though.

Thank you,

henrique.

Mime
View raw message