spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Prus <>
Subject Is BindingParquetOutputCommitter still used?
Date Wed, 08 Sep 2021 15:21:10 GMT

per, when using
S3 storage one is advised to set these options:

> spark.sql.parquet.output.committer.class

However, looking at code and trying simple tests suggests that
BindingParquetOutputCommitter is not used at all. Specifically, I used this

  import org.apache.log4j.{Level, Logger}


  val spark = SparkSession.builder().master("local[*]")
    .config("fs.s3a.committer.magic.enabled", "true")
    .config("fs.s3.committer.magic.enabled", "true")
    .config("", "magic")
    .config("", "magic")
  import spark.implicits._
  val df = Seq("foo", "bar").toDF("s")


I observe that magic committer is used, and I get trace log message from
PathOutputCommitProtocol, but not from BindingParquetOutputCommitter.
If I remove configuration options that set BindingParquetOutputCommitter, I
still see magic committer used.
The spark.sql.parquet.output.committer.class option is only used in
ParquetFileFormat, where it is copied to
and that option, in turn, is only used by SQLHadoopMapReduceCommitProtocol
- which we don't use here.

So, it sounds like setting parquet.output.committer.class to is no
longer necessary?
Or is there some code path where it matters?

Vladimir Prus

View raw message