spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: [DISCUSS] naming policy of Spark configs
Date Thu, 13 Feb 2020 03:29:04 GMT
+1 Thanks for the proposal. Looks very reasonable to me.

On Thu, Feb 13, 2020 at 10:53 AM Hyukjin Kwon <gurwls223@gmail.com> wrote:

> +1.
>
> 2020년 2월 13일 (목) 오전 9:30, Gengliang Wang <gengliang.wang@databricks.com>님이
> 작성:
>
>> +1, this is really helpful. We should make the SQL configurations
>> consistent and more readable.
>>
>> On Wed, Feb 12, 2020 at 3:33 PM Rubén Berenguel <rberenguel@gmail.com>
>> wrote:
>>
>>> I love it, it will make configs easier to read and write. Thanks
>>> Wenchen.
>>>
>>> R
>>>
>>> On 13 Feb 2020, at 00:15, Dongjoon Hyun <dongjoon.hyun@gmail.com> wrote:
>>>
>>> 
>>> Thank you, Wenchen.
>>>
>>> The new policy looks clear to me. +1 for the explicit policy.
>>>
>>> So, are we going to revise the existing conf names before 3.0.0 release?
>>>
>>> Or, is it applied to new up-coming configurations from now?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Feb 12, 2020 at 7:43 AM Wenchen Fan <cloud0fan@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to discuss the naming policy of Spark configs, as for now it
>>>> depends on personal preference which leads to inconsistent namings.
>>>>
>>>> In general, the config name should be a noun that describes its meaning
>>>> clearly.
>>>> Good examples:
>>>> spark.sql.session.timeZone
>>>> spark.sql.streaming.continuous.executorQueueSize
>>>> spark.sql.statistics.histogram.numBins
>>>> Bad examples:
>>>> spark.sql.defaultSizeInBytes (default size for what?)
>>>>
>>>> Also note that, config name has many parts, joined by dots. Each part
>>>> is a namespace. Don't create namespace unnecessarily.
>>>> Good example:
>>>> spark.sql.execution.rangeExchange.sampleSizePerPartition
>>>> spark.sql.execution.arrow.maxRecordsPerBatch
>>>> Bad examples:
>>>> spark.sql.windowExec.buffer.in.memory.threshold ("in" is not a useful
>>>> namespace, better to be .buffer.inMemoryThreshold)
>>>>
>>>> For a big feature, usually we need to create an umbrella config to turn
>>>> it on/off, and other configs for fine-grained controls. These configs
>>>> should share the same namespace, and the umbrella config should be named
>>>> like featureName.enabled. For example:
>>>> spark.sql.cbo.enabled
>>>> spark.sql.cbo.starSchemaDetection
>>>> spark.sql.cbo.starJoinFTRatio
>>>> spark.sql.cbo.joinReorder.enabled
>>>> spark.sql.cbo.joinReorder.dp.threshold (BTW "dp" is not a good
>>>> namespace)
>>>> spark.sql.cbo.joinReorder.card.weight (BTW "card" is not a good
>>>> namespace)
>>>>
>>>> For boolean configs, in general it should end with a verb, e.g.
>>>> spark.sql.join.preferSortMergeJoin. If the config is for a feature and
>>>> you can't find a good verb for the feature, featureName.enabled is
>>>> also good.
>>>>
>>>> I'll update https://spark.apache.org/contributing.html after we reach
>>>> a consensus here. Any comments are welcome!
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>>
>>>>

Mime
View raw message