spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Merge multiple different s3 logs using pyspark 2.4.3
Date Thu, 09 Jan 2020 15:23:32 GMT
Hi Shraddha,

what is interesting to me that people do not even have the courtesy to
write their name when they request for help to user groups :)

your solution is spot on, there is another option available in spark SQL
though for this.


Regards,
Gourav Sengupta

On Thu, Jan 9, 2020 at 1:19 PM Shraddha Shah <shah.shraddha.18@gmail.com>
wrote:

> Unless I am reading this wrong, this can be achieved with aws sync ?
>
> aws s3 sync
> s3://my-bucket/ingestion/source1/y=2019/m=12/d=12 s3://my-bucket/ingestion/processed/
> *src_category=other*/y=2019/m=12/d=12
>
> Thanks,
> -Shraddha
>
>
>
> On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta <gourav.sengupta@gmail.com>
> wrote:
>
>> why s3a?
>>
>> On Thu, Jan 9, 2020 at 2:20 AM anbutech <anbutech17@outlook.com> wrote:
>>
>>> Hello,
>>>
>>> version = spark 2.4.3
>>>
>>> I have 3 different sources json logs data which having same schema(same
>>> columns order) in the raw data and want to add one new column as
>>> "src_category"  for all the  3 different source to distinguish the
>>> source
>>> category  and merge all the  3 different sources into the single
>>> dataframe
>>> to read the json data for the  processing.what is the best way to handle
>>> this case.
>>>
>>> df = spark.read.json(merged_3sourcesraw_data)
>>>
>>> Input:
>>>
>>> s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
>>> s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
>>> s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json
>>>
>>> output:
>>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
>>>
>>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
>>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>

Mime
View raw message