spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Anthony <statm...@gmail.com>
Subject Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0
Date Mon, 11 Sep 2017 22:19:21 GMT
any other feedback on this?


On 9/8/17 11:00 AM, Neil Jonkers wrote:
> Can you provide a code sample please?
>
> On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony <statmatt@gmail.com 
> <mailto:statmatt@gmail.com>> wrote:
>
>     Hi all -
>
>
>     since upgrading to 2.2.0, we've noticed a significant increase in
>     read.parquet(...) ops. The parquet files are being read from S3.
>     Upon entry at the interactive terminal (pyspark in this case), the
>     terminal will sit "idle" for several minutes (as many as 10)
>     before returning:
>
>
>     "17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table
>     partition metadata from memory due to size constraints
>     (spark.sql.hive.filesourcePartitionFileCacheSize = 2000000000
>     bytes). This may impact query planning performance."
>
>
>     In the spark UI, there are no jobs being run during this idle
>     period. Subsequently, a short 1-task job lasting approximately 10
>     seconds runs, and then another idle time of roughly 2-3 minutes
>     follows thereafter before returning to the terminal/CLI.
>
>
>     Can someone explain what is happening here in the background? Is
>     there a misconfiguration we should be looking for? We are using
>     Hive metastore on the EMR cluster.
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>     <mailto:user-unsubscribe@spark.apache.org>
>
>


Mime
View raw message