spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksander Eskilson <>
Subject Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable
Date Fri, 14 Oct 2016 15:10:52 GMT
I've opened a Jira to the issue you requested earlier:

On Fri, Oct 14, 2016 at 9:24 AM Aleksander Eskilson <>

> Interesting. I'm quite glad to read your explanation, it makes some of our
> work quite a bit more clear. I'll open a ticket in a similar vein to this
> discussion:, contrasting
> nullability implementation as optimization and and enforcement.
> Additionally, shall I go ahead and open a ticket pointing out the missing
> call to .asNullable in the streaming reader?
> Thanks,
> Alek
> On Thu, Oct 13, 2016 at 4:05 PM Michael Armbrust <>
> wrote:
> There is a <> lot
> <> of
> <> confusion
> <> around
> <> nullable
> <> in
> <> StructType
> <> and we should
> definitly come up with a consistent story and make sure we have better
> documentation.  This might even mean deprecating this field.  At a high
> level, I think the key problem is that internally, nullable is used as an
> optimization, not an enforcement mechanism.  This is a lot different than
> NOT NULL in a traditional database. Specifically, we take "nulllable =
> false" as a promise that that column will never be null and use that fact
> to do optimizations like skipping null checks.  This means that if you lie
> to us, you actually can get wrong answers (i.e. 0 instead of null).  This
> is clearly confusing and not ideal.
> A little bit of explanation for some of the specific cases you brought up:
> the reason that we call asNullable on file sources is that even if that
> column is never null in the data, there are cases that can still produce a
> null result.  For example, when parsing JSON, we null out all columns other
> than _corrupt_record when we fail to parse a line.  The fact that its
> different in streaming is a bug.
> Would you mind opening up a JIRA ticket, and we discuss the right path
> forward there?
> On Thu, Oct 13, 2016 at 12:35 PM, Aleksander Eskilson <
>> wrote:
> Hi there,
> Working in the space of custom Encoders/ExpressionEncoders, I've noticed
> that the StructType schema as set when creating an object of the
> ExpressionEncoder[T] class [1] is not the schema actually used to set types
> for the columns of a Dataset, as created by using the .as(encoder) method
> [2] on read data. Instead, what occurs is that the schema is either
> inferred through analysis of the data, or a schema can be provided using
> the .schema(structType) method [3] of the DataFrameReader. However, when
> using the .schema(..) method of DataFrameReader, potentially undesirable
> behaviour occurs: while the DataSource is being resolved, all FieldTypes of
> the a StructType schema have their nullability set to *true* (using the
> asNullable function of StructTypes) [4] when the data is read from a local
> file, as opposed to a non-streaming source.
> Of course, allowing null-values where they shouldn't exist can weaken the
> type-guarantees for DataSets over certain types of encoded data.
> Thinking on how this might be resolved, first, if it's a legitimate bug,
> I'm not sure why "non-streaming file based" datasources need to have their
> StructFields all rendered nullable. Simply removing the call to asNullable
> would fix the issue. Second, if it's actually necessary for most
> filesystem-read data-sources to have their StructFields potentially
> nullable in this manner, we could instead let the StructType schema
> provided to the Encoder have the final say in the DataSet's schema.
> This latter option seems sensible to me: if a client is willing to provide
> a custom Encoder via the .as(..) method on the reader, presumably in
> setting the schema field of the encoder they have some legitimate notion of
> how their object's types should be mapped to DataSet column types. Any
> failure when resolving their data to a DataSet by means of their Encoder
> can then be traced to their Encoder for their own debugging.
> Thoughts? Thanks,
> Alek Eskilson
> [1] -
> [2] -
> [3] -
> [4] -

View raw message