spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subhash Sriram <subhash.sri...@gmail.com>
Subject Re: Spark & S3 - Introducing random values into key names
Date Thu, 08 Mar 2018 17:19:23 GMT
Thanks, Vadim! That helps and makes sense. I don't think we have a number of keys so large
that we have to worry about it. If we do, I think I would go with an approach similar to what
you suggested.

Thanks again,
Subhash 

Sent from my iPhone

> On Mar 8, 2018, at 11:56 AM, Vadim Semenov <vadim@datadoghq.com> wrote:
> 
> You need to put randomness into the beginning of the key, if you put it other than into
the beginning, it's not guaranteed that you're going to have good performance.
> 
> The way we achieved this is by writing to HDFS first, and then having a custom DistCp
implemented using Spark that copies parquet files using random keys,
> and then saves the list of resulting keys to S3, and when we want to use those parquet
files, we just need to load the listing file, and then take keys from it and pass them into
the loader.
> 
> You only need to do this when you have way too many files, if the number of keys you
operate is reasonably small (let's say, in thousands), you won't get any benefits.
> 
> Also the S3 buckets have internal optimizations, and overtime it adjusts to the workload,
i.e. some additional underlying partitions are getting added, some splits happen, etc.
> If you want to have good performance from start, you would need to use randomization,
yes.
> Or alternatively, you can contact AWS and tell them about the naming schema that you're
going to have (but it must be set in stone), and then they can try to pre-optimize the bucket
for you.
> 
>> On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram <subhash.sriram@gmail.com>
wrote:
>> Hey Spark user community,
>> 
>> I am writing Parquet files from Spark to S3 using S3a. I was reading this article
about improving S3 bucket performance, specifically about how it can help to introduce randomness
to your key names so that data is written to different partitions.
>> 
>> https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
>> 
>> Is there a straight forward way to accomplish this randomness in Spark via the DataSet
API? The only thing that I could think of would be to actually split the large set into multiple
sets (based on row boundaries), and then write each one with the random key name.
>> 
>> Is there an easier way that I am missing?
>> 
>> Thanks in advance!
>> Subhash
>> 
>> 
> 

Mime
View raw message