spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subhash Sriram <subhash.sri...@gmail.com>
Subject Spark & S3 - Introducing random values into key names
Date Thu, 08 Mar 2018 16:42:51 GMT
Hey Spark user community,

I am writing Parquet files from Spark to S3 using S3a. I was reading this
article about improving S3 bucket performance, specifically about how it
can help to introduce randomness to your key names so that data is written
to different partitions.

https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/

Is there a straight forward way to accomplish this randomness in Spark via
the DataSet API? The only thing that I could think of would be to actually
split the large set into multiple sets (based on row boundaries), and then
write each one with the random key name.

Is there an easier way that I am missing?

Thanks in advance!
Subhash

Mime
View raw message