spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Merge multiple different s3 logs using pyspark 2.4.3
Date Thu, 09 Jan 2020 12:05:10 GMT
why s3a?

On Thu, Jan 9, 2020 at 2:20 AM anbutech <anbutech17@outlook.com> wrote:

> Hello,
>
> version = spark 2.4.3
>
> I have 3 different sources json logs data which having same schema(same
> columns order) in the raw data and want to add one new column as
> "src_category"  for all the  3 different source to distinguish the source
> category  and merge all the  3 different sources into the single dataframe
> to read the json data for the  processing.what is the best way to handle
> this case.
>
> df = spark.read.json(merged_3sourcesraw_data)
>
> Input:
>
> s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
> s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
> s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json
>
> output:
> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
>
> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows
>
>
> Thanks
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message