spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Evans <jeffrey.wayne.ev...@gmail.com>
Subject Distinguishing between field missing and null in individual record?
Date Tue, 25 Jun 2019 18:03:22 GMT
Suppose we have the following JSON, which we parse into a DataFrame
(using the mulitline option).

[{
  "id": 8541,
  "value": "8541 changed again value"
},{
  "id": 51109,
  "name": "newest bob",
  "value": "51109 changed again"
}]

Regardless of whether we explicitly define a schema, or allow it to be
inferred, the result of df.show(), after parsing this data, is similar
to the following:

+-----+----------+--------------------+
|   id|      name|               value|
+-----+----------+--------------------+
| 8541|      null|8541 changed agai...|
|51109|newest bob| 51109 changed again|
+-----+----------+--------------------+

Notice that the name column for the first row is null.  This JSON will
produce an identical DataFrame:

[{
  "id": 8541,
  "name": null,
  "value": "8541 changed again value"
},{
  "id": 51109,
  "name": "newest bob",
  "value": "51109 changed again"
}]

Is there a way to distinguish between these two cases in the DataFrame
(i.e. field is missing, but added as null due to inferred or explicit
schema, versus field is present but with null value)?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message