spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerrick Hoang <jerrickho...@gmail.com>
Subject Re: Spark Sql behaves strangely with tables with a lot of partitions
Date Sun, 23 Aug 2015 09:00:00 GMT
anybody has any suggestions?

On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang <jerrickhoang@gmail.com>
wrote:

> Is there a workaround without updating Hadoop? Would really appreciate if
> someone can explain what spark is trying to do here and what is an easy way
> to turn this off. Thanks all!
>
> On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey <
> raghavendra.pandey@gmail.com> wrote:
>
>> Did you try with hadoop version 2.7.1 .. It is known that s3a works
>> really well with parquet which is available in 2.7. They fixed lot of
>> issues related to metadata reading there...
>> On Aug 21, 2015 11:24 PM, "Jerrick Hoang" <jerrickhoang@gmail.com> wrote:
>>
>>> @Cheng, Hao : Physical plans show that it got stuck on scanning S3!
>>>
>>> (table is partitioned by date_prefix and hour)
>>> explain select count(*) from test_table where date_prefix='20150819' and
>>> hour='00';
>>>
>>> TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
>>>  TungstenExchange SinglePartition
>>>   TungstenAggregate(key=[],
>>> value=[(count(1),mode=Partial,isDistinct=false)]
>>>    Scan ParquetRelation[ .. <about 1000 partition paths go here> ]
>>>
>>> Why does spark have to scan all partitions when the query only concerns
>>> with 1 partitions? Doesn't it defeat the purpose of partitioning?
>>>
>>> Thanks!
>>>
>>> On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver <philip.weaver@gmail.com>
>>> wrote:
>>>
>>>> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before,
>>>> and I couldn't find much information about it online. What does it mean
>>>> exactly to disable it? Are there any negative consequences to disabling it?
>>>>
>>>> On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.cheng@intel.com>
>>>> wrote:
>>>>
>>>>> Can you make some more profiling? I am wondering if the driver is busy
>>>>> with scanning the HDFS / S3.
>>>>>
>>>>> Like jstack <pid of driver process>
>>>>>
>>>>>
>>>>>
>>>>> And also, it’s will be great if you can paste the physical plan for
>>>>> the simple query.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Jerrick Hoang [mailto:jerrickhoang@gmail.com]
>>>>> *Sent:* Thursday, August 20, 2015 1:46 PM
>>>>> *To:* Cheng, Hao
>>>>> *Cc:* Philip Weaver; user
>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>>>>> partitions
>>>>>
>>>>>
>>>>>
>>>>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple
>>>>> of CLs trying to speed up spark sql with tables with a huge number of
>>>>> partitions, I've made sure that those CLs are included but it's still
very
>>>>> slow
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.cheng@intel.com>
>>>>> wrote:
>>>>>
>>>>> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled
>>>>> to false.
>>>>>
>>>>>
>>>>>
>>>>> BTW, which version are you using?
>>>>>
>>>>>
>>>>>
>>>>> Hao
>>>>>
>>>>>
>>>>>
>>>>> *From:* Jerrick Hoang [mailto:jerrickhoang@gmail.com]
>>>>> *Sent:* Thursday, August 20, 2015 12:16 PM
>>>>> *To:* Philip Weaver
>>>>> *Cc:* user
>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of
>>>>> partitions
>>>>>
>>>>>
>>>>>
>>>>> I guess the question is why does spark have to do partition discovery
>>>>> with all partitions when the query only needs to look at one partition?
Is
>>>>> there a conf flag to turn this off?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <
>>>>> philip.weaver@gmail.com> wrote:
>>>>>
>>>>> I've had the same problem. It turns out that Spark (specifically
>>>>> parquet) is very slow at partition discovery. It got better in 1.5 (not
yet
>>>>> released), but was still unacceptably slow. Sadly, we ended up reading
>>>>> parquet files manually in Python (via C++) and had to abandon Spark SQL
>>>>> because of this problem.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickhoang@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>>
>>>>> I did a simple experiment with Spark SQL. I created a partitioned
>>>>> parquet table with only one partition (date=20140701). A simple `select
>>>>> count(*) from table where date=20140701` would run very fast (0.1 seconds).
>>>>> However, as I added more partitions the query takes longer and longer.
When
>>>>> I added about 10,000 partitions, the query took way too long. I feel
like
>>>>> querying for a single partition should not be affected by having more
>>>>> partitions. Is this a known behaviour? What does spark try to do here?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jerrick
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>

Mime
View raw message