spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerrick Hoang <jerrickho...@gmail.com>
Subject Re: Spark Sql behaves strangely with tables with a lot of partitions
Date Thu, 20 Aug 2015 04:16:12 GMT
I guess the question is why does spark have to do partition discovery with
all partitions when the query only needs to look at one partition? Is there
a conf flag to turn this off?

On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <philip.weaver@gmail.com>
wrote:

> I've had the same problem. It turns out that Spark (specifically parquet)
> is very slow at partition discovery. It got better in 1.5 (not yet
> released), but was still unacceptably slow. Sadly, we ended up reading
> parquet files manually in Python (via C++) and had to abandon Spark SQL
> because of this problem.
>
> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickhoang@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I did a simple experiment with Spark SQL. I created a partitioned parquet
>> table with only one partition (date=20140701). A simple `select count(*)
>> from table where date=20140701` would run very fast (0.1 seconds). However,
>> as I added more partitions the query takes longer and longer. When I added
>> about 10,000 partitions, the query took way too long. I feel like querying
>> for a single partition should not be affected by having more partitions. Is
>> this a known behaviour? What does spark try to do here?
>>
>> Thanks,
>> Jerrick
>>
>
>

Mime
View raw message