spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewan Leith <ewan.le...@realitymine.com>
Subject Parquet file organisation for 100GB+ dataframes
Date Wed, 12 Aug 2015 10:28:12 GMT
Hi all,

Can anyone share their experiences working with storing and organising larger datasets with
Spark?

I've got a dataframe stored in Parquet on Amazon S3 (using EMRFS) which has a fairly complex
nested schema (based on JSON files), which I can query in Spark, but the initial setup takes
a few minutes, as we've got roughly 5000 partitions and 150GB of compressed parquet part files.

Generally things work, but we seem to be hitting various limitations now we're working with
100+GB of data, such as the 2GB block size limit in Spark which means we need a large number
of partitions, slow startup due to partition discovery, etc.

Storing data in one big dataframe has worked well so far, but do we need to look at breaking
it out into multiple dataframes?

Has anyone got any advice on how to structure this?

Thanks,
Ewan


Mime
View raw message