spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Panayotov <SPanayo...@msn.com>
Subject RE: we control spark file names before we write them - should we opensource it?
Date Mon, 08 Jun 2020 13:55:47 GMT
Yes, I think so.

Stefan Panayotov, PhD
spanayotov@outlook.com
spanayotov@comcast.net
spanayot59@gmail.com

-----Original Message-----
From: ilaimalka <ilai.malka@nielsen.com> 
Sent: Monday, June 8, 2020 9:17 AM
To: user@spark.apache.org
Subject: we control spark file names before we write them - should we opensource it?

Hi, as part of our work we needed more control over the name of the files written out by Spark,
e.g instead of "part-...csv.gz" we want to get something like this "15988891_1748330679_20200507124153.tsv.gz"
where the first number is hardcoded, the second one is the value from partitionBy and third
is a timestamp in provided SimpleDateFormat.

After a long research for possibilities, the most common way is to find those files and rename
them *after* the spark job has finished. We tried to find a more efficient way.

We decided to implement a new DataSource which is actually a wrapper to most standard Spark
file formats (csv, json, text, parquet, avro), which allows us to rename the file before it's
written.

In short, this is how it works :
Datasource extends FileFormat and implements prepareWrite - which redirects to local FileNameOutputWriterFactory
TypeFactory which redirects to original Spark Formats FileNameOutputWriterFactory which actually
do the work and by reflection can call any implementation to control the file name  

The question is - is this interesting/useful enough for the community?
Should we open-source it?
Thanks!

p.s we wrote the same question on spark channel on ASF if you want to discuss it there:
https://the-asf.slack.com/archives/CD5UQDNBA/p1589117451069600



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message