spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng, Hao" <hao.ch...@intel.com>
Subject RE: Spark Sql behaves strangely with tables with a lot of partitions
Date Thu, 20 Aug 2015 05:53:14 GMT
Can you make some more profiling? I am wondering if the driver is busy with scanning the HDFS
/ S3.
Like jstack <pid of driver process>

And also, it’s will be great if you can paste the physical plan for the simple query.

From: Jerrick Hoang [mailto:jerrickhoang@gmail.com]
Sent: Thursday, August 20, 2015 1:46 PM
To: Cheng, Hao
Cc: Philip Weaver; user
Subject: Re: Spark Sql behaves strangely with tables with a lot of partitions

I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs trying to speed
up spark sql with tables with a huge number of partitions, I've made sure that those CLs are
included but it's still very slow

On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.cheng@intel.com<mailto:hao.cheng@intel.com>>
wrote:
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false.

BTW, which version are you using?

Hao

From: Jerrick Hoang [mailto:jerrickhoang@gmail.com<mailto:jerrickhoang@gmail.com>]
Sent: Thursday, August 20, 2015 12:16 PM
To: Philip Weaver
Cc: user
Subject: Re: Spark Sql behaves strangely with tables with a lot of partitions

I guess the question is why does spark have to do partition discovery with all partitions
when the query only needs to look at one partition? Is there a conf flag to turn this off?

On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <philip.weaver@gmail.com<mailto:philip.weaver@gmail.com>>
wrote:
I've had the same problem. It turns out that Spark (specifically parquet) is very slow at
partition discovery. It got better in 1.5 (not yet released), but was still unacceptably slow.
Sadly, we ended up reading parquet files manually in Python (via C++) and had to abandon Spark
SQL because of this problem.

On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickhoang@gmail.com<mailto:jerrickhoang@gmail.com>>
wrote:
Hi all,

I did a simple experiment with Spark SQL. I created a partitioned parquet table with only
one partition (date=20140701). A simple `select count(*) from table where date=20140701` would
run very fast (0.1 seconds). However, as I added more partitions the query takes longer and
longer. When I added about 10,000 partitions, the query took way too long. I feel like querying
for a single partition should not be affected by having more partitions. Is this a known behaviour?
What does spark try to do here?

Thanks,
Jerrick



Mime
View raw message