spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Kizhakkel Jose <felixkizhakkelj...@gmail.com>
Subject How to modify a field in a nested struct using pyspark
Date Fri, 29 Jan 2021 16:30:53 GMT
Hello All,

I am using pyspark structured streaming and I am getting timestamp fields
as plain long (milliseconds), so I have to modify these fields into a
timestamp type

a sample json object object:

{
  "id":{
      "value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
      "type": "UUID",
      "system": "Test"
    },
  "status": "Active",
  "timingPeriod": {
    "startDateTime": 1611859271516,
    "endDateTime": null
  },
  "eventDateTime": 1611859272122,
  "isPrimary": true,
}

  Here I want to convert "eventDateTime" and "startDateTime" and
"endDateTime" as timestamp types

So I have done following,

def transform_date_col(date_col):
    return f.when(f.col(date_col).isNotNull(), f.col(date_col) / 1000)

df.withColumn(
    "eventDateTime",
transform_date_col("eventDateTime").cast("timestamp")).withColumn(
    "timingPeriod.start",
transform_date_col("timingPeriod.start").cast("timestamp")).withColumn(
    "timingPeriod.end",
transform_date_col("timingPeriod.end").cast("timestamp"))

the timingPeriod fields are not a struct anymore rather they become two
different fields with names "timingPeriod.start", "timingPeriod.end".

How can I get them as a struct as before?
Is there a generic way I can modify a single/multiple properties of nested
structs?

I have hundreds of entities where the long needs to convert to timestamp,
so a generic implementation will help my data ingestion pipeline a lot.

Regards,
Felix K Jose

Mime
View raw message