spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yann Moisan <yam...@gmail.com>
Subject [Spark SQL] [Spark 2.4.0] Performance regression when reading parquet files from S3
Date Wed, 14 Nov 2018 20:07:29 GMT
Hello,

A Spark job on EMR reads parquet files located in an s3 bucket.

I use this option : spark.hadoop.fs.s3a.experimental.input.fadvise=random

When the ec2 instances and the bucket are in the same region, performance
are quite the same but when there are not, performance drops down (job
duration is multiplied by 2).

Note :  using the default value for the parameter mitigate the issue.

spark.hadoop.fs.s3a.experimental.input.fadvise=sequential

Any idea on what has changed in Spark 2.4.0 that could explain this issue ?

Mime
View raw message