spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Czech <>
Subject How to use HDFS >3.1.1 with spark 2.3.3 to output parquet files to S3?
Date Sun, 14 Jul 2019 22:10:45 GMT
As the subject suggest I want to output an parquet to S3. I know this was
rather troublesome in the past because of S3 not having a move but needed
to do a copy+delete.
This issues has been discussed before see:

Now Hadoop-13786 <> is
fixing this problem in Hadoop 3.1.0 and later. How can I use that with
spark 2.3.3? I usually orchestrate my cluster on EC2 with flintrock
<>. Do I just set in the flintrock
config HDFS to 3.1.1 and everything "just works"? Or do I also have to set
a committer algorithm like this when I create my spark context in pyspark:


thanks for the help!

View raw message