spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kazuaki Ishizaki" <>
Subject Re: Why are DataFrames always read with nullable=True?
Date Tue, 21 Mar 2017 01:57:51 GMT
Regarding reading part for nullable, it seems to be considered to add a 
data cleaning step as Xiao said at

Here is a PR to add the data 
cleaning step that throws an exception if null exists in non-null column.
Any comments are appreciated.

Kazuaki Ishizaki

From:   Jason White <>
Date:   2017/03/21 06:31
Subject:        Why are DataFrames always read with nullable=True?

If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.

If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file 
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here: Those
seemed to be discussing writing, not reading, though, and writing is 
supported now.

Is this functionality still desirable? Is it potentially not applicable 
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable 
pass an option to the DataFrameReader to disable this functionality?

View this message in context:

Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe e-mail:

View raw message