spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Weaver <philip.wea...@gmail.com>
Subject Very high latency to initialize a DataFrame from partitioned parquet database.
Date Thu, 06 Aug 2015 00:26:40 GMT
I have a parquet directory that was produced by partitioning by two keys,
e.g. like this:

df.write.partitionBy("a", "b").parquet("asdf")


There are 35 values of "a", and about 1100-1200 values of "b" for each
value of "a", for a total of over 40,000 partitions.

Before running any transformations or actions on the DataFrame, just
initializing it like this takes *2 minutes*:

val df = sqlContext.read.parquet("asdf")


Is this normal? Is this because it is doing some bookeeping to discover all
the partitions? Is it perhaps having to merge the schema from each
partition? Would you expect it to get better or worse if I subpartition by
another key?

- Philip

Mime
View raw message