spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walid LEZZAR <walez...@gmail.com>
Subject How to read the schema of a partitioned dataframe without listing all the partitions ?
Date Fri, 27 Apr 2018 11:42:06 GMT
Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about
1000 partitions). With spark, when I just want to know the schema of this
parquet without even asking for a single row of data, spark tries to list
all the partitions and the nested partitions of the parquet. Which makes it
very slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read
a single partition and give me the schema of that partition and consider it
as the schema of the whole dataframe" ? (I don't care about schema merge,
it's off by the way)

Thanks.
Walid.

Mime
View raw message