spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Kizhakkel Jose <felixkizhakkelj...@gmail.com>
Subject Re: How to modify a field in a nested struct using pyspark
Date Sat, 30 Jan 2021 02:11:28 GMT
Thank you so much for the quick response and great help.

@jeff, I will use the library if the 3.1 release is getting delayed. Thank
you so much.

On Fri, Jan 29, 2021 at 1:23 PM Jeff Evans <jeffrey.wayne.evans@gmail.com>
wrote:

> If you need to do this in 2.x, this library does the trick:
> https://github.com/fqaiser94/mse
>
> On Fri, Jan 29, 2021 at 12:15 PM Adam Binford <adamq43@gmail.com> wrote:
>
>> I think they're voting on the next release candidate starting sometime
>> next week. So hopefully barring any other major hurdles within the next few
>> weeks.
>>
>> On Fri, Jan 29, 2021, 1:01 PM Felix Kizhakkel Jose <
>> felixkizhakkeljose@gmail.com> wrote:
>>
>>> Wow, that's really great to know. Thank you so much Adam. Do you know
>>> when the 3.1 release is scheduled?
>>>
>>> Regards,
>>> Felix K Jose
>>>
>>> On Fri, Jan 29, 2021 at 12:35 PM Adam Binford <adamq43@gmail.com> wrote:
>>>
>>>> As of 3.0, the only way to do it is something that will recreate the
>>>> whole struct:
>>>> df.withColumn('timingPeriod',
>>>> f.struct(f.col('timingPeriod.start').cast('timestamp').alias('start'),
>>>> f.col('timingPeriod.end').cast('timestamp').alias('end')))
>>>>
>>>> There's a new method coming in 3.1 on the column class called withField
>>>> which was designed for this purpose. I backported it to my personal 3.0
>>>> build because of how useful it is. It works something like:
>>>> df.withColumn('timingPeriod', f.col('timingPeriod').withField('start',
>>>> f.col('timingPeriod.start').cast('timestamp')).withField('end',
>>>> f.col('timingPeriod.end')))
>>>>
>>>> And it works on multiple levels of nesting which is nice.
>>>>
>>>> On Fri, Jan 29, 2021 at 11:32 AM Felix Kizhakkel Jose <
>>>> felixkizhakkeljose@gmail.com> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> I am using pyspark structured streaming and I am getting timestamp
>>>>> fields as plain long (milliseconds), so I have to modify these fields
into
>>>>> a timestamp type
>>>>>
>>>>> a sample json object object:
>>>>>
>>>>> {
>>>>>   "id":{
>>>>>       "value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
>>>>>       "type": "UUID",
>>>>>       "system": "Test"
>>>>>     },
>>>>>   "status": "Active",
>>>>>   "timingPeriod": {
>>>>>     "startDateTime": 1611859271516,
>>>>>     "endDateTime": null
>>>>>   },
>>>>>   "eventDateTime": 1611859272122,
>>>>>   "isPrimary": true,
>>>>> }
>>>>>
>>>>>   Here I want to convert "eventDateTime" and "startDateTime" and
>>>>> "endDateTime" as timestamp types
>>>>>
>>>>> So I have done following,
>>>>>
>>>>> def transform_date_col(date_col):
>>>>>     return f.when(f.col(date_col).isNotNull(), f.col(date_col) / 1000)
>>>>>
>>>>> df.withColumn(
>>>>>     "eventDateTime", transform_date_col("eventDateTime").cast("timestamp")).withColumn(
>>>>>     "timingPeriod.start", transform_date_col("timingPeriod.start").cast("timestamp")).withColumn(
>>>>>     "timingPeriod.end", transform_date_col("timingPeriod.end").cast("timestamp"))
>>>>>
>>>>> the timingPeriod fields are not a struct anymore rather they become
>>>>> two different fields with names "timingPeriod.start", "timingPeriod.end".
>>>>>
>>>>> How can I get them as a struct as before?
>>>>> Is there a generic way I can modify a single/multiple properties of
>>>>> nested structs?
>>>>>
>>>>> I have hundreds of entities where the long needs to convert to
>>>>> timestamp, so a generic implementation will help my data ingestion pipeline
>>>>> a lot.
>>>>>
>>>>> Regards,
>>>>> Felix K Jose
>>>>>
>>>>>
>>>>
>>>> --
>>>> Adam Binford
>>>>
>>>

Mime
View raw message