Hi Shraddha,

what is interesting to me that people do not even have the courtesy to write their name when they request for help to user groups :)

your solution is spot on, there is another option available in spark SQL though for this.


Regards,
Gourav Sengupta

On Thu, Jan 9, 2020 at 1:19 PM Shraddha Shah <shah.shraddha.18@gmail.com> wrote:
Unless I am reading this wrong, this can be achieved with aws sync ? 

aws s3 sync s3://my-bucket/ingestion/source1/y=2019/m=12/d=12 s3://my-bucket/ingestion/processed/src_category=other/y=2019/m=12/d=12

Thanks,
-Shraddha



On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta <gourav.sengupta@gmail.com> wrote:
why s3a?

On Thu, Jan 9, 2020 at 2:20 AM anbutech <anbutech17@outlook.com> wrote:
Hello,

version = spark 2.4.3

I have 3 different sources json logs data which having same schema(same
columns order) in the raw data and want to add one new column as
"src_category"  for all the  3 different source to distinguish the source
category  and merge all the  3 different sources into the single dataframe
to read the json data for the  processing.what is the best way to handle
this case.

df = spark.read.json(merged_3sourcesraw_data)

Input:

s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json

output:
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows


Thanks




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org