spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Damji <dmat...@comcast.net>
Subject Re: Request to document the direct relationship between other configurations
Date Wed, 12 Feb 2020 16:08:14 GMT
All are valid and valuable observations to put into practice: 

* structured and meaningful config names 
* explainable text or succinct description
* easily accessible or searchable 

While these are aspirational but gradually doable if we make it part of the dev and review
cycle. Often meaningful config names, like security, come as after thought. 

At the AMA in Amsterdam Spark Summit last year, a few developers lamented the lack of finding
Spark configs—what they do; what they are used for; when and why; and what are their default
values. 

Though you one fetch them programmatically, one still has to know what specific config one
islooking for. 

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

>> On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <gurwls223@gmail.com> wrote:
> 
> Yeah, that's one of my point why I dont want to document this in the guide yet.
> 
> I would like to make sure dev people are on the same page that documenting is a better
practice. I dont intend to force as a hard requirement; however, if that's pointed out, it
should better to address.
> 
> 
>> On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <cloud0fan@gmail.com> wrote:
>> In general I think it's better to have more detailed documents, but we don't have
to force everyone to do it if the config name is structured. I would +1 to document the relationship
of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.
>> 
>>> On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>>> Also, I would like to hear other people' thoughts on here. Could I ask what you
guys think about this in general?
>>> 
>>> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls223@gmail.com>님이
작성:
>>>> To do that, we should explicitly document such structured configuration and
implicit effect, which is currently missing.
>>>> I would be more than happy if we document such implied relationship, and
if we are very sure all configurations are structured correctly coherently.
>>>> Until that point, I think it might be more practical to simply document it
for now.
>>>> 
>>>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue
on practice - whether to duplicate description between configuration code and doc. I have
been asked to add description on configuration code regardlessly, and existing codebase doesn't.
This configuration is widely-used one.
>>>> This is actually something we should fix too. in SQL configuration, now we
don't have such duplications as of https://github.com/apache/spark/pull/27459 as it generates.
We should do it in other configurations.
>>>> 
>>>> 
>>>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <kabhwan.opensource@gmail.com>님이
작성:
>>>>> I'm looking into the case of `spark.dynamicAllocation` and this seems
to be the thing to support my voice.
>>>>> 
>>>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation
>>>>> 
>>>>> I don't disagree with adding "This requires spark.shuffle.service.enabled
to be set." in the description of `spark.dynamicAllocation.enabled`. This cannot be inferred
implicitly, hence it should be better to have it.
>>>>> 
>>>>> Why I'm in favor of structured configuration & implicit effect over
describing everything explicitly is there. 
>>>>> 
>>>>> 1. There're 10 configurations (if the doc doesn't miss any other configuration)
except `spark.dynamicAllocation.enabled`, and only 4 configurations are referred in the description
of `spark.dynamicAllocation.enabled` - majority of config keys are missing.
>>>>> 2. I think it's intentional, but the table starts with `spark.dynamicAllocation.enabled`
which talks implicitly but intuitively that if you disable this then everything on dynamic
allocation won't work. Missing majority of references on config keys don't get it hard to
understand.
>>>>> 3. Even `spark.dynamicAllocation` has bad case - see `spark.dynamicAllocation.shuffleTracking.enabled`
and `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the structure of configuration.
I think this is worse than not explicitly mentioning the description. Let's assume the name
has been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive that setting
`spark.dynamicAllocation.shuffleTracking.enabled` to `false` would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?
>>>>> 
>>>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue
on practice - whether to duplicate description between configuration code and doc. I have
been asked to add description on configuration code regardlessly, and existing codebase doesn't.
This configuration is widely-used one.
>>>>> 
>>>>> 
>>>>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls223@gmail.com>
wrote:
>>>>>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to
do it although it might be "redundant" :-) since anyone can give feedback to any thread in
Spark dev mailing list, and discuss.
>>>>>> 
>>>>>> This is actually more prevailing given my rough reading of configuration
files. I would like to see this missing relationship as a bad pattern, started from a personal
preference.
>>>>>> 
>>>>>> > Personally I'd rather not think someone won't understand setting
`.enabled` to `false` means the functionality is disabled and effectively it disables all
sub-configurations.
>>>>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations
for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one
of major
>>>>>> > benefits of the structured configurations.
>>>>>> 
>>>>>> I don't think this is a good idea we assume for users to know such
contexts. One might think `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
>>>>>> partially enable the feature. It is better to be explicit to document
since some of configurations are even difficult for users to confirm if it is working or not.
>>>>>> For instance, one might think setting 'spark.eventLog.rolling.maxFileSize'
automatically enables rolling. Then, they realise the log is not rolling later after the file
>>>>>> size becomes bigger.
>>>>>> 
>>>>>> 
>>>>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <kabhwan.opensource@gmail.com>님이
작성:
>>>>>>> I'm sorry if I miss something, but this is ideally better to
be started as [DISCUSS] as I haven't seen any reference to have consensus on this practice.
>>>>>>> 
>>>>>>> For me it's just there're two different practices co-existing
on the codebase, meaning it's closer to the preference of individual (with implicitly agreeing
that others have different preferences), or it hasn't been discussed thoughtfully.
>>>>>>> 
>>>>>>> Personally I'd rather not think someone won't understand setting
`.enabled` to `false` means the functionality is disabled and effectively it disables all
sub-configurations. E.g. when `spark.sql.adaptive.enabled` is `false`, all the configurations
for `spark.sql.adaptive.*` are implicitly no-op. For me this is pretty intuitive and the one
of major benefits of the structured configurations.
>>>>>>> 
>>>>>>> If we want to make it explicit, "all" sub-configurations should
have redundant part of the doc. More redundant if the condition is nested. I agree this is
the good step of "be kind" but less pragmatic.
>>>>>>> 
>>>>>>> I'd be happy to follow the consensus we would make in this thread.
Appreciate more voices.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>> 
>>>>>>> 
>>>>>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <gurwls223@gmail.com>
wrote:
>>>>>>>> > I don't plan to document this officially yet
>>>>>>>> Just to prevent confusion, I meant I don't yet plan to document
the fact that we should write the relationships in configurations as a code/review guideline
in https://spark.apache.org/contributing.html
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls223@gmail.com>님이
작성:
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I happened to review some PRs and I noticed that some
configurations don't have some information
>>>>>>>>> necessary.
>>>>>>>>> 
>>>>>>>>> To be explicit, I would like to make sure we document
the direct relationship between other configurations
>>>>>>>>> in the documentation. For example, `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
>>>>>>>>> can be only enabled when `spark.sql.adaptive.enabled`
is enabled. That's clearly documented.
>>>>>>>>> We're good in general given that we document them in
general in Apache Spark.
>>>>>>>>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled',
'spark.sql.parquet.filterPushdown', etc. 
>>>>>>>>> 
>>>>>>>>> However, I noticed such a pattern that such information
is missing in some components in general, for example,
>>>>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*`
and `spark.history.ui.acls.* `
>>>>>>>>> 
>>>>>>>>> I hope we all start to document such information. Logically
users can't know the relationship and I myself
>>>>>>>>> had to read the codes to confirm when I review. 
>>>>>>>>> I don't plan to document this officially yet because
to me it looks a pretty logical request to me; however,
>>>>>>>>> let me know if you guys have some different opinions.
>>>>>>>>> 
>>>>>>>>> Thanks.

Mime
View raw message