spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hariharan <hariharan...@gmail.com>
Subject Re: Spark performance over S3
Date Wed, 07 Apr 2021 11:12:44 GMT
Hi Tzahi,

Comparing the first two cases:
- > reads the parquet files from S3 and also writes to S3, it takes 22 min
- > reads the parquet files from S3 and writes to its local hdfs, it takes
the same amount of time (±22 min)

It looks like most of the time is being spent in reading, and the time
spent in writing is likely negligible (probably you're not writing much
output?)

Can you clarify what is the difference between these two?

> reads the parquet files from S3 and writes to its local hdfs, it takes
the same amount of time (±22 min)?
> reads the parquet files from S3 (they were copied into the hdfs before)
and writes to its local hdfs, the job took 7 min

In the second case, was the data read from hdfs or s3?

Regarding the point from the post you linked to:
1, Enhanced networking does make a difference
<https://laptrinhx.com/hadoop-with-enhanced-networking-on-aws-1893465489/>,
but it should be automatically enabled if you're using a compatible
instance type and an AWS AMI. However if you're using a custom AMI, you
might want to check if it's enabled for you.
2. VPC endpoints also can make a difference in performance - at least that
used to be the case a few years ago. Maybe that has changed now.

Couple of other things you might want to check:
1. If your bucket is versioned, you may want to check if you're using
the ListObjectsV2
API in S3A <https://issues.apache.org/jira/browse/HADOOP-13421>.
2. Also check these recommendations from Cloudera
<https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-performance.html>
for optimal use of S3A.

Thanks,
Hariharan



On Wed, Apr 7, 2021 at 12:15 AM Tzahi File <tzahi.file@ironsrc.com> wrote:

> Hi All,
>
> We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
>
> The spark job running on that cluster reads from an S3 bucket and writes
> to that bucket.
>
> the bucket and the ec2 run in the same region.
>
> As part of our efforts to reduce the runtime of our spark jobs we found
> there's serious latency when reading from S3.
>
> When the job:
>
>    - reads the parquet files from S3 and also writes to S3, it takes 22
>    min
>    - reads the parquet files from S3 and writes to its local hdfs, it
>    takes the same amount of time (±22 min)
>    - reads the parquet files from S3 (they were copied into the hdfs
>    before) and writes to its local hdfs, the job took 7 min
>
> the spark job has the following S3-related configuration:
>
>    - spark.hadoop.fs.s3a.connection.establish.timeout=5000
>    - spark.hadoop.fs.s3a.connection.maximum=200
>
> when reading from S3 we tried to increase the
> spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900
> but it didn't reduce the S3 latency.
>
> Do you have any idea for the cause of the read latency from S3?
>
> I saw this post
> <https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/>
to
> improve the transfer speed, is something here relevant?
>
>
> Thanks,
> Tzahi
>

Mime
View raw message