spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Weaver <philip.wea...@gmail.com>
Subject Re: Very high latency to initialize a DataFrame from partitioned parquet database.
Date Thu, 06 Aug 2015 07:58:59 GMT
I built spark from the v1.5.0-snapshot-20150803 tag in the repo and tried
again.

The initialization time is about 1 minute now, which is still pretty
terrible.

On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver <philip.weaver@gmail.com>
wrote:

> Absolutely, thanks!
>
> On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:
>
>> We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396
>>
>> Could you give it a shot to see whether it helps in your case? We've
>> observed ~50x performance boost with schema merging turned on.
>>
>> Cheng
>>
>>
>> On 8/6/15 8:26 AM, Philip Weaver wrote:
>>
>> I have a parquet directory that was produced by partitioning by two keys,
>> e.g. like this:
>>
>> df.write.partitionBy("a", "b").parquet("asdf")
>>
>>
>> There are 35 values of "a", and about 1100-1200 values of "b" for each
>> value of "a", for a total of over 40,000 partitions.
>>
>> Before running any transformations or actions on the DataFrame, just
>> initializing it like this takes *2 minutes*:
>>
>> val df = sqlContext.read.parquet("asdf")
>>
>>
>> Is this normal? Is this because it is doing some bookeeping to discover
>> all the partitions? Is it perhaps having to merge the schema from each
>> partition? Would you expect it to get better or worse if I subpartition by
>> another key?
>>
>> - Philip
>>
>>
>>
>>
>

Mime
View raw message