spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devender Yadav <>
Subject Add column value in the dataset on the basis of a condition
Date Tue, 18 Dec 2018 13:47:56 GMT
Hi All,

useful code:

public class EmployeeBean implements Serializable {

    private Long id;

    private String name;

    private Long salary;

    private Integer age;

    // getters and setters


Relevant spark code:

SparkSession spark = SparkSession.builder().master("local[2]").appName("play-with-spark").getOrCreate();
List<EmployeeBean> employees1 = populateEmployees(1, 10);

Dataset<EmployeeBean> ds1 = spark.createDataset(employees1, Encoders.bean(EmployeeBean.class));;

Dataset<Row> ds2 = ds1.where("age is null").withColumn("is_age_null", lit(true));
Dataset<Row> ds3 = ds1.where("age is not null").withColumn("is_age_null", lit(false));

Dataset<Row> ds4 = ds2.union(ds3);;

Relevant Output:


| age| id|name|salary|
|null|  1|dev1| 11000|
|   2|  2|dev2| 12000|
|null|  3|dev3| 13000|
|   4|  4|dev4| 14000|
|null|  5|dev5| 15000|


| age| id|name|salary|is_age_null|
|null|  1|dev1| 11000|       true|
|null|  3|dev3| 13000|       true|
|null|  5|dev5| 15000|       true|
|   2|  2|dev2| 12000|      false|
|   4|  4|dev4| 14000|      false|

Is there any better solution to add this column in the dataset rather than creating two datasets
and performing union?




NOTE: This message may contain information that is confidential, proprietary, privileged or
otherwise protected by law. The message is intended solely for the named addressee. If received
in error, please destroy and notify the sender. Any use of this email is prohibited when received
in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this
communication has been maintained nor that the communication is free of errors, virus, interception
or interference.

View raw message