spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerrick Hoang <jerrickho...@gmail.com>
Subject Re: Spark Sql behaves strangely with tables with a lot of partitions
Date Fri, 21 Aug 2015 17:54:03 GMT
@Cheng, Hao : Physical plans show that it got stuck on scanning S3!

(table is partitioned by date_prefix and hour)
explain select count(*) from test_table where date_prefix='20150819' and
hour='00';

TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], value=[(count(1),mode=Partial,isDistinct=false)]
   Scan ParquetRelation[ .. <about 1000 partition paths go here> ]

Why does spark have to scan all partitions when the query only concerns
with 1 partitions? Doesn't it defeat the purpose of partitioning?

Thanks!

On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver <philip.weaver@gmail.com>
wrote:

> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before, and
> I couldn't find much information about it online. What does it mean exactly
> to disable it? Are there any negative consequences to disabling it?
>
> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.cheng@intel.com> wrote:
>
>> Can you make some more profiling? I am wondering if the driver is busy
>> with scanning the HDFS / S3.
>>
>> Like jstack <pid of driver process>
>>
>>
>>
>> And also, it’s will be great if you can paste the physical plan for the
>> simple query.
>>
>>
>>
>> *From:* Jerrick Hoang [mailto:jerrickhoang@gmail.com]
>> *Sent:* Thursday, August 20, 2015 1:46 PM
>> *To:* Cheng, Hao
>> *Cc:* Philip Weaver; user
>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>> partitions
>>
>>
>>
>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of
>> CLs trying to speed up spark sql with tables with a huge number of
>> partitions, I've made sure that those CLs are included but it's still very
>> slow
>>
>>
>>
>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.cheng@intel.com> wrote:
>>
>> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to
>> false.
>>
>>
>>
>> BTW, which version are you using?
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Jerrick Hoang [mailto:jerrickhoang@gmail.com]
>> *Sent:* Thursday, August 20, 2015 12:16 PM
>> *To:* Philip Weaver
>> *Cc:* user
>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>> partitions
>>
>>
>>
>> I guess the question is why does spark have to do partition discovery
>> with all partitions when the query only needs to look at one partition? Is
>> there a conf flag to turn this off?
>>
>>
>>
>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <philip.weaver@gmail.com>
>> wrote:
>>
>> I've had the same problem. It turns out that Spark (specifically parquet)
>> is very slow at partition discovery. It got better in 1.5 (not yet
>> released), but was still unacceptably slow. Sadly, we ended up reading
>> parquet files manually in Python (via C++) and had to abandon Spark SQL
>> because of this problem.
>>
>>
>>
>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickhoang@gmail.com>
>> wrote:
>>
>> Hi all,
>>
>>
>>
>> I did a simple experiment with Spark SQL. I created a partitioned parquet
>> table with only one partition (date=20140701). A simple `select count(*)
>> from table where date=20140701` would run very fast (0.1 seconds). However,
>> as I added more partitions the query takes longer and longer. When I added
>> about 10,000 partitions, the query took way too long. I feel like querying
>> for a single partition should not be affected by having more partitions. Is
>> this a known behaviour? What does spark try to do here?
>>
>>
>>
>> Thanks,
>>
>> Jerrick
>>
>>
>>
>>
>>
>>
>>
>
>

Mime
View raw message