spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerrick Hoang <jerrickho...@gmail.com>
Subject Spark Sql behaves strangely with tables with a lot of partitions
Date Thu, 20 Aug 2015 02:51:50 GMT
Hi all,

I did a simple experiment with Spark SQL. I created a partitioned parquet
table with only one partition (date=20140701). A simple `select count(*)
from table where date=20140701` would run very fast (0.1 seconds). However,
as I added more partitions the query takes longer and longer. When I added
about 10,000 partitions, the query took way too long. I feel like querying
for a single partition should not be affected by having more partitions. Is
this a known behaviour? What does spark try to do here?

Thanks,
Jerrick

Mime
View raw message