Hi all,

 

Can anyone share their experiences working with storing and organising larger datasets with Spark?

 

I’ve got a dataframe stored in Parquet on Amazon S3 (using EMRFS) which has a fairly complex nested schema (based on JSON files), which I can query in Spark, but the initial setup takes a few minutes, as we’ve got roughly 5000 partitions and 150GB of compressed parquet part files.

 

Generally things work, but we seem to be hitting various limitations now we’re working with 100+GB of data, such as the 2GB block size limit in Spark which means we need a large number of partitions, slow startup due to partition discovery, etc.

 

Storing data in one big dataframe has worked well so far, but do we need to look at breaking it out into multiple dataframes?

 

Has anyone got any advice on how to structure this?

 

Thanks,

Ewan