spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praneeth Gayam <praneeth.ga...@gmail.com>
Subject Re: Spark Dataframe returning null columns when schema is specified
Date Fri, 08 Sep 2017 06:34:26 GMT
What is the desired behaviour when a field is null for only a few records?
You can not avoid nulls in this case
But if all rows are guaranteed to be uniform(either all-null are
all-non-null), you can *take* the first row of the DF and *drop* the
columns with null fields.

On Fri, Sep 8, 2017 at 12:14 AM, ravi6c2 <ravi.6c2@gmail.com> wrote:

> Hi All, I have this problem where in Spark Dataframe is having null columns
> for the attributes from JSON that are not present. A clear explanation is
> provided below:
>
> *Use case:* Convert the JSON object into dataframe for further usage.
>
> *Case - 1:* Without specifying the schema for JSON:
>
> records.foreachRDD(new VoidFunction2<JavaRDD&lt;String>, Time>() {
>                         private static final long serialVersionUID = 1L;
>                         @Override
>                         public void call(JavaRDD<String> rdd, Time time)
> throws Exception {
>                                 if (rdd.count() > 0) {
>                                         JavaRDD<String> filteredRDD =
> rdd.filter(x -> x.length()>0);
>                                         sqlContext = SQLContextSingleton.
> getInstance(filteredRDD.context());
>                                         DataFrame df =
> sqlContext.read().json(filteredRDD);
>                                         df.show();
>                                 }
>                         }
>                 });
>
> In the above code sample, filteredRDD is Strings as JSON Objects.
>
> *Sample JSON Record: *
> {"request_id":"f791e831f71e4918b2fcaebfdf6fe2c2","org_id":"y08e7p9g","
> queue_id":1234,"disposition":"O","created":"2017-06-02
> 23:49:10.410","assigned":"2017-06-02
> 23:49:10.410","final_review_status":"cancel","datetime":"2017-06-02
> 23:49:10.410"}
>
> *Dataframe Output:*
>
> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/t8407/Screenshot_at_Sep_07_11-36-27.png>
>
> *Case - 2:* With specifying the schema for JSON:
>
> records.foreachRDD(new VoidFunction2<JavaRDD&lt;String>, Time>() {
>                         private static final long serialVersionUID = 1L;
>                         @Override
>                         public void call(JavaRDD<String> rdd, Time time)
> throws Exception {
>                                 if (rdd.count() > 0) {
>                                         JavaRDD<String> filteredRDD =
> rdd.filter(x -> x.length()>0);
>                                         sqlContext = SQLContextSingleton.
> getInstance(filteredRDD.context());
>                                         DataFrame df =
> sqlContext.read().schema(SchemaBuilder.buildSchema()).json(filteredRDD);
>                                         df.show();
>                                 }
>                         }
>                 });
>
> In the above code sample, filteredRDD is Strings as JSON Objects.
>
> *Schema Definition:*
> public static StructType buildSchema() {
>                 StructType schema = new StructType(
>                                 new StructField[] {
> DataTypes.createStructField("request_id",
> DataTypes.StringType, false),
>
> DataTypes.createStructField("org_id", DataTypes.StringType, false),
>
> DataTypes.createStructField("queue_id", DataTypes.IntegerType, true),
>
> DataTypes.createStructField("owner", DataTypes.StringType, true),
>
> DataTypes.createStructField("disposition", DataTypes.StringType,
> true),
>
> DataTypes.createStructField("created", DataTypes.TimestampType, true),
>
> DataTypes.createStructField("created_user", DataTypes.StringType,
> true),
>
> DataTypes.createStructField("assigned", DataTypes.TimestampType,
> true),
>
> DataTypes.createStructField("assigned_user", DataTypes.StringType,
> true),
>
> DataTypes.createStructField("notes", DataTypes.StringType, true),
>
> DataTypes.createStructField("final_review_status",
> DataTypes.StringType, true),
>
> DataTypes.createStructField("event_tag", DataTypes.StringType, true),
>
> DataTypes.createStructField("additional_data", DataTypes.StringType,
> true),
>
> DataTypes.createStructField("datetime", DataTypes.TimestampType,
> true),
>
> DataTypes.createStructField("dc", DataTypes.StringType, true),
>
> DataTypes.createStructField("case_id", DataTypes.StringType, true),
>
> DataTypes.createStructField("case_status", DataTypes.StringType, true)
> });
>                 return (schema);
>         }
>
> *Sample JSON Record: *
> {"request_id":"f791e831f71e4918b2fcaebfdf6fe2c2","org_id":"y08e7p9g","
> queue_id":1234,"disposition":"O","created":"2017-06-02
> 23:49:10.410","assigned":"2017-06-02
> 23:49:10.410","final_review_status":"cancel","datetime":"2017-06-02
> 23:49:10.410"}
>
> *Dataframe Output:*
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8407/sample.png
> >
>
> If you see in the above case, when schema is defined I am getting the
> columns that are not specified in the JSON as null. Any work around on
> getting the result as expected in the first image(without nulls) using
> schema? I needed this to perform updates into Kudu table. As the other
> columns are assigned NULL, they are getting updated into KUDU as null which
> is not desired.
>
> Thanks,
> Ravi
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message