spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Weaver <philip.wea...@gmail.com>
Subject Re: Very high latency to initialize a DataFrame from partitioned parquet database.
Date Thu, 06 Aug 2015 04:08:00 GMT
Absolutely, thanks!

On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

> We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396
>
> Could you give it a shot to see whether it helps in your case? We've
> observed ~50x performance boost with schema merging turned on.
>
> Cheng
>
>
> On 8/6/15 8:26 AM, Philip Weaver wrote:
>
> I have a parquet directory that was produced by partitioning by two keys,
> e.g. like this:
>
> df.write.partitionBy("a", "b").parquet("asdf")
>
>
> There are 35 values of "a", and about 1100-1200 values of "b" for each
> value of "a", for a total of over 40,000 partitions.
>
> Before running any transformations or actions on the DataFrame, just
> initializing it like this takes *2 minutes*:
>
> val df = sqlContext.read.parquet("asdf")
>
>
> Is this normal? Is this because it is doing some bookeeping to discover
> all the partitions? Is it perhaps having to merge the schema from each
> partition? Would you expect it to get better or worse if I subpartition by
> another key?
>
> - Philip
>
>
>
>

Mime
View raw message