spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject [DISCUSS] naming policy of Spark configs
Date Wed, 12 Feb 2020 15:42:43 GMT
Hi all,

I'd like to discuss the naming policy of Spark configs, as for now it
depends on personal preference which leads to inconsistent namings.

In general, the config name should be a noun that describes its meaning
clearly.
Good examples:
spark.sql.session.timeZone
spark.sql.streaming.continuous.executorQueueSize
spark.sql.statistics.histogram.numBins
Bad examples:
spark.sql.defaultSizeInBytes (default size for what?)

Also note that, config name has many parts, joined by dots. Each part is a
namespace. Don't create namespace unnecessarily.
Good example:
spark.sql.execution.rangeExchange.sampleSizePerPartition
spark.sql.execution.arrow.maxRecordsPerBatch
Bad examples:
spark.sql.windowExec.buffer.in.memory.threshold ("in" is not a useful
namespace, better to be .buffer.inMemoryThreshold)

For a big feature, usually we need to create an umbrella config to turn it
on/off, and other configs for fine-grained controls. These configs should
share the same namespace, and the umbrella config should be named like
featureName.enabled. For example:
spark.sql.cbo.enabled
spark.sql.cbo.starSchemaDetection
spark.sql.cbo.starJoinFTRatio
spark.sql.cbo.joinReorder.enabled
spark.sql.cbo.joinReorder.dp.threshold (BTW "dp" is not a good namespace)
spark.sql.cbo.joinReorder.card.weight (BTW "card" is not a good namespace)

For boolean configs, in general it should end with a verb, e.g.
spark.sql.join.preferSortMergeJoin. If the config is for a feature and you
can't find a good verb for the feature, featureName.enabled is also good.

I'll update https://spark.apache.org/contributing.html after we reach a
consensus here. Any comments are welcome!

Thanks,
Wenchen

Mime
View raw message