spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saurabh Gulati <saurabh.gul...@fedex.com.INVALID>
Subject Re: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in SPARK 2.4.x
Date Thu, 12 Aug 2021 15:14:51 GMT
We had issues with this migration mainly because of changes in spark date calendars. See<https://www.waitingforcode.com/apache-spark-sql/whats-new-apache-spark-3-proleptic-calendar-date-time-management/read>
We got this working by setting the below params:

("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY"),
("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED"),
("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY"),
("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")


But otherwise, it's a change for good. Performance seems better.
Also, there were bugs in 3.0.1 which have been addressed in 3.1.1.
________________________________
From: Gourav Sengupta <gourav.sengupta.developer@gmail.com>
Sent: 05 August 2021 10:17
To: user @spark <user@spark.apache.org>
Subject: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

Caution! This email originated outside of FedEx. Please do not open attachments or click links
from an unknown or suspicious origin.

Hi,

we are trying to migrate some of the data lake pipelines to run in SPARK 3.x, where as the
dependent pipelines using those tables will be still running in SPARK 2.4.x for sometime to
come.

Does anyone know of any issues that can happen:
1. when reading Parquet files written in 3.1.x in SPARK 2.4
2. when in the data lake some partitions have parquet files written in SPARK 2.4.x and some
are in SPARK 3.1.x.

Please note that there are no changes in schema, but later on we might end up adding or removing
some columns.

I will be really grateful for your kind help on this.

Regards,
Gourav Sengupta

Mime
View raw message