spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yong Zhang <java8...@hotmail.com>
Subject Re: How to read the schema of a partitioned dataframe without listing all the partitions ?
Date Fri, 27 Apr 2018 14:07:48 GMT
What version of Spark you are using?


You can search "spark.sql.parquet.mergeSchema" on https://spark.apache.org/docs/latest/sql-programming-guide.html


Starting from Spark 1.5, the default is already "false", which means Spark shouldn't scan
all the parquet files to generate the schema.


Yong

Spark SQL and DataFrames - Spark 2.3.0 Documentation<https://spark.apache.org/docs/latest/sql-programming-guide.html>
spark.apache.org
Global Temporary View. Temporary views in Spark SQL are session-scoped and will disappear
if the session that creates it terminates. If you want to have a temporary view that is shared
among all sessions and keep alive until the Spark application terminates, you can create a
global temporary view.




________________________________
From: Walid LEZZAR <walezz89@gmail.com>
Sent: Friday, April 27, 2018 7:42 AM
To: spark users
Subject: How to read the schema of a partitioned dataframe without listing all the partitions
?

Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 1000 partitions).
With spark, when I just want to know the schema of this parquet without even asking for a
single row of data, spark tries to list all the partitions and the nested partitions of the
parquet. Which makes it very slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read a single partition
and give me the schema of that partition and consider it as the schema of the whole dataframe"
? (I don't care about schema merge, it's off by the way)

Thanks.
Walid.

Mime
View raw message