spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: Request to document the direct relationship between other configurations
Date Wed, 12 Feb 2020 13:18:22 GMT
Yeah, that's one of my point why I dont want to document this in the guide
yet.

I would like to make sure dev people are on the same page that documenting
is a better practice. I dont intend to force as a hard requirement;
however, if that's pointed out, it should better to address.


On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <cloud0fan@gmail.com> wrote:

> In general I think it's better to have more detailed documents, but we
> don't have to force everyone to do it if the config name is structured. I
> would +1 to document the relationship of we can't tell it from the config
> names, e.g. spark.shuffle.service.enabled
> and spark.dynamicAllocation.enabled.
>
> On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>
>> Also, I would like to hear other people' thoughts on here. Could I ask
>> what you guys think about this in general?
>>
>> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls223@gmail.com>님이
작성:
>>
>>> To do that, we should explicitly document such structured configuration
>>> and implicit effect, which is currently missing.
>>> I would be more than happy if we document such implied relationship,
>>> *and* if we are very sure all configurations are structured correctly
>>> coherently.
>>> Until that point, I think it might be more practical to simply document
>>> it for now.
>>>
>>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having another
>>> issue on practice - whether to duplicate description between configuration
>>> code and doc. I have been asked to add description on configuration code
>>> regardlessly, and existing codebase doesn't. This configuration is
>>> widely-used one.
>>> This is actually something we should fix too. in SQL configuration, now
>>> we don't have such duplications as of
>>> https://github.com/apache/spark/pull/27459 as it generates. We should
>>> do it in other configurations.
>>>
>>>
>>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <kabhwan.opensource@gmail.com>님이
>>> 작성:
>>>
>>>> I'm looking into the case of `spark.dynamicAllocation` and this seems
>>>> to be the thing to support my voice.
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation
>>>>
>>>> I don't disagree with adding "This requires
>>>> spark.shuffle.service.enabled to be set." in the description of
>>>> `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly,
>>>> hence it should be better to have it.
>>>>
>>>> Why I'm in favor of structured configuration & implicit effect over
>>>> describing everything explicitly is there.
>>>>
>>>> 1. There're 10 configurations (if the doc doesn't miss any other
>>>> configuration) except `spark.dynamicAllocation.enabled`, and only 4
>>>> configurations are referred in the description of
>>>> `spark.dynamicAllocation.enabled` - majority of config keys are missing.
>>>> 2. I think it's intentional, but the table starts
>>>> with `spark.dynamicAllocation.enabled` which talks implicitly but
>>>> intuitively that if you disable this then everything on dynamic allocation
>>>> won't work. Missing majority of references on config keys don't get it hard
>>>> to understand.
>>>> 3. Even `spark.dynamicAllocation` has bad case - see
>>>> `spark.dynamicAllocation.shuffleTracking.enabled` and
>>>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the
>>>> structure of configuration. I think this is worse than not explicitly
>>>> mentioning the description. Let's assume the name has
>>>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive
>>>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false`
>>>> would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?
>>>>
>>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue
>>>> on practice - whether to duplicate description between configuration code
>>>> and doc. I have been asked to add description on configuration code
>>>> regardlessly, and existing codebase doesn't. This configuration is
>>>> widely-used one.
>>>>
>>>>
>>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls223@gmail.com>
>>>> wrote:
>>>>
>>>>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to do
>>>>> it although it might be "redundant" :-) since anyone can give feedback
to
>>>>> any thread in Spark dev mailing list, and discuss.
>>>>>
>>>>> This is actually more prevailing given my rough reading of
>>>>> configuration files. I would like to see this missing relationship as
a bad
>>>>> pattern, started from a personal preference.
>>>>>
>>>>> > Personally I'd rather not think someone won't understand setting
>>>>> `.enabled` to `false` means the functionality is disabled and effectively
>>>>> it disables all sub-configurations.
>>>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the
>>>>> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me
this
>>>>> is pretty intuitive and the one of major
>>>>> > benefits of the structured configurations.
>>>>>
>>>>> I don't think this is a good idea we assume for users to know such
>>>>> contexts. One might think
>>>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
>>>>> partially enable the feature. It is better to be explicit to document
>>>>> since some of configurations are even difficult for users to confirm
if it
>>>>> is working or not.
>>>>> For instance, one might think setting
>>>>> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. Then,
>>>>> they realise the log is not rolling later after the file
>>>>> size becomes bigger.
>>>>>
>>>>>
>>>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <kabhwan.opensource@gmail.com>님이
>>>>> 작성:
>>>>>
>>>>>> I'm sorry if I miss something, but this is ideally better to be
>>>>>> started as [DISCUSS] as I haven't seen any reference to have consensus
on
>>>>>> this practice.
>>>>>>
>>>>>> For me it's just there're two different practices co-existing on
the
>>>>>> codebase, meaning it's closer to the preference of individual (with
>>>>>> implicitly agreeing that others have different preferences), or it
hasn't
>>>>>> been discussed thoughtfully.
>>>>>>
>>>>>> Personally I'd rather not think someone won't understand setting
>>>>>> `.enabled` to `false` means the functionality is disabled and effectively
>>>>>> it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled`
>>>>>> is `false`, all the configurations for `spark.sql.adaptive.*` are
>>>>>> implicitly no-op. For me this is pretty intuitive and the one of
major
>>>>>> benefits of the structured configurations.
>>>>>>
>>>>>> If we want to make it explicit, "all" sub-configurations should have
>>>>>> redundant part of the doc. More redundant if the condition is nested.
I
>>>>>> agree this is the good step of "be kind" but less pragmatic.
>>>>>>
>>>>>> I'd be happy to follow the consensus we would make in this thread.
>>>>>> Appreciate more voices.
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <gurwls223@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > I don't plan to document this officially yet
>>>>>>> Just to prevent confusion, I meant I don't yet plan to document
the
>>>>>>> fact that we should write the relationships in configurations
as a
>>>>>>> code/review guideline in https://spark.apache.org/contributing.html
>>>>>>>
>>>>>>>
>>>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls223@gmail.com>님이
작성:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I happened to review some PRs and I noticed that some
>>>>>>>> configurations don't have some information
>>>>>>>> necessary.
>>>>>>>>
>>>>>>>> To be explicit, I would like to make sure we document the
direct
>>>>>>>> relationship between other configurations
>>>>>>>> in the documentation. For example,
>>>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
>>>>>>>> can be only enabled when `spark.sql.adaptive.enabled` is
enabled.
>>>>>>>> That's clearly documented.
>>>>>>>> We're good in general given that we document them in general
in
>>>>>>>> Apache Spark.
>>>>>>>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled',
>>>>>>>> 'spark.sql.parquet.filterPushdown', etc.
>>>>>>>>
>>>>>>>> However, I noticed such a pattern that such information is
missing
>>>>>>>> in some components in general, for example,
>>>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*`
and
>>>>>>>> `spark.history.ui.acls.* `
>>>>>>>>
>>>>>>>> I hope we all start to document such information. Logically
users
>>>>>>>> can't know the relationship and I myself
>>>>>>>> had to read the codes to confirm when I review.
>>>>>>>> I don't plan to document this officially yet because to me
it looks
>>>>>>>> a pretty logical request to me; however,
>>>>>>>> let me know if you guys have some different opinions.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>

Mime
View raw message