spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Very high latency to initialize a DataFrame from partitioned parquet database.
Date Thu, 06 Aug 2015 04:07:25 GMT
We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396

Could you give it a shot to see whether it helps in your case? We've 
observed ~50x performance boost with schema merging turned on.

Cheng

On 8/6/15 8:26 AM, Philip Weaver wrote:
> I have a parquet directory that was produced by partitioning by two 
> keys, e.g. like this:
>
>     df.write.partitionBy("a", "b").parquet("asdf")
>
>
> There are 35 values of "a", and about 1100-1200 values of "b" for each 
> value of "a", for a total of over 40,000 partitions.
>
> Before running any transformations or actions on the DataFrame, just 
> initializing it like this takes *2 minutes*:
>
>     val df = sqlContext.read.parquet("asdf")
>
>
> Is this normal? Is this because it is doing some bookeeping to 
> discover all the partitions? Is it perhaps having to merge the schema 
> from each partition? Would you expect it to get better or worse if I 
> subpartition by another key?
>
> - Philip
>
>


Mime
View raw message