spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Flatten log data Using Pyspark
Date Tue, 03 Dec 2019 06:01:03 GMT
Why do you want to use UDF?

Regards,
Gourav

On Sat, Nov 30, 2019 at 3:06 AM anbutech <anbutech17@outlook.com> wrote:

> Hi,
>
> I have a raw source data frame having 2 columns as below
>
> timestamp
> 2019-11-29 9:30:45
>
> message_log
>
> <123>NOV 29 10:20:35 ips01 sfids: connection:
> tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1
>
> how do we break above each key value as separate columns using udf in
> pyspark?
>
> what is the right approach for flattening this type of log data - regex or
> python logic?
>
> Could you please help me the logic to bring flattening the log data?
>
> Final output dataframe having the below  each columns:
>
> timestamp
> 2019-11-29 9:30:45
>
> prio
> 123
>
> msg_ts
> NOV 29 10:20:35
>
> msg_ids
> ips01
>
> sfids
>
> connection
> tcp
>
> bytes
> 104
>
> user
> unknown
>
> url
> unknown
>
> host
> 127.0.0.1
>
>
> Thanks
> Anbu
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message