spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Anthony <>
Subject [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0
Date Fri, 08 Sep 2017 15:44:47 GMT
Hi all -

since upgrading to 2.2.0, we've noticed a significant increase in 
read.parquet(...) ops. The parquet files are being read from S3. Upon 
entry at the interactive terminal (pyspark in this case), the terminal 
will sit "idle" for several minutes (as many as 10) before returning:

"17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table 
partition metadata from memory due to size constraints 
(spark.sql.hive.filesourcePartitionFileCacheSize = 2000000000 bytes). 
This may impact query planning performance."

In the spark UI, there are no jobs being run during this idle period. 
Subsequently, a short 1-task job lasting approximately 10 seconds runs, 
and then another idle time of roughly 2-3 minutes follows thereafter 
before returning to the terminal/CLI.

Can someone explain what is happening here in the background? Is there a 
misconfiguration we should be looking for? We are using Hive metastore 
on the EMR cluster.

To unsubscribe e-mail:

View raw message