spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Jonkers <neilod...@gmail.com>
Subject Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0
Date Fri, 08 Sep 2017 17:00:09 GMT
Can you provide a code sample please?

On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony <statmatt@gmail.com> wrote:

> Hi all -
>
>
> since upgrading to 2.2.0, we've noticed a significant increase in
> read.parquet(...) ops. The parquet files are being read from S3. Upon entry
> at the interactive terminal (pyspark in this case), the terminal will sit
> "idle" for several minutes (as many as 10) before returning:
>
>
> "17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached table
> partition metadata from memory due to size constraints
> (spark.sql.hive.filesourcePartitionFileCacheSize = 2000000000 bytes).
> This may impact query planning performance."
>
>
> In the spark UI, there are no jobs being run during this idle period.
> Subsequently, a short 1-task job lasting approximately 10 seconds runs, and
> then another idle time of roughly 2-3 minutes follows thereafter before
> returning to the terminal/CLI.
>
>
> Can someone explain what is happening here in the background? Is there a
> misconfiguration we should be looking for? We are using Hive metastore on
> the EMR cluster.
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message