spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: Request to document the direct relationship between other configurations
Date Thu, 13 Feb 2020 03:44:24 GMT
I think it’s just fine as long as we’re consistent with the instances
having the description, for instance:

  When true and ‘spark.xx.xx’ is enabled, …

I think this is 2-2 in most cases so far. I think we can reference other
configuration keys in another configuration documentation by using
ADAPTIVE_EXECUTION_ENABLED.key as an example. So we don’t have duplication
problem in most cases.

Being consistent with the current base is a general guidance if I am not
mistaken. We have identified a problem and a good goal to reach.
If we want to change, let's do it as our final goal. Otherwise, let's
simplify it to reduce the overhead rather then having a policy for the
mid-term specifically.


2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim <kabhwan.opensource@gmail.com>님이
작성:

> I tend to agree that there should be a time to make thing be consistent
> (and I'm very happy to see the new thread on discussion) and we may want to
> take some practice in the interim.
>
> But for me it is not clear what is the practice in the interim. I pointed
> out the problems of existing style and if we all agree the problems are
> valid then we may need to fix it before start using it widely.
>
> For me the major question is "where" to put - at least there're two
> different approaches:
>
> 1) put it to the description of `.enable` and describe the range of impact
> (e.g.) put the description of "spark.A.enable" saying it will affect the
> following configurations under "spark.A".
> (I don't think it's good to enumerate all of affected configs, as
> `spark.dynamicAllocation.enable` is telling it is fragile.)
>
> 2) put it to the description of all affected configurations and describe
> which configuration is the prerequisite to let this be effective. e.g. put
> it on all configurations under `spark.dynamicAllocation` mentioning
> `spark.dynamicAllocation.enable` should be enabled to make this be
> effective.
>
> (I intended to skip mentioning "cross reference". Let's be pragmatic.)
>
> 2) has also two ways to describe:
>
> 2-1) Just mention simply - like "When dynamic allocation is enabled,", not
> pointing out the key to toggle. This hugely simplify the description,
> though end users may have to do the second guess to find the key to toggle.
> (It'll be intuitive when we structurize the configurations.)
>
> 2-2) Mention the key to toggle directly - like "This is effective only if
> spark.A.enable is set to true.". It's going to be longer depending on how
> long the configuration key is. Personally I'd feel verbose unless the key
> to toggle is not placed to the spot we can infer, but others may have
> different opinions.
>
> I may be missing some details, so please participate to add the details.
> Otherwise we may want to choose the best one, and have a sample form of
> description. (Once we reach here, it may be OK to add to the contribution
> doc, as that is the thing we agree about.)
>
> Without the details it's going to be a some sort of "preference" which is
> natural to have disagreement, hence it doesn't make sense someone is forced
> to do something if something turns out to be "preference".
>
> On Thu, Feb 13, 2020 at 11:10 AM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>
>> Adding those information is already a more prevailing style at this
>> moment, and this is usual to follow prevailing side if there isn't a
>> specific reason.
>> If there is confusion about this, I will explicitly add it into the guide
>> (https://spark.apache.org/contributing.html). Let me know if this still
>> confuses or disagree.
>>
>> 2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <gurwls223@gmail.com>님이
작성:
>>
>>> Yes, that's probably our final goal to revisit the configurations to
>>> make it structured and deduplicated documentation cleanly. +1.
>>>
>>> One point I would like to add is though to add such information to the
>>> documentation until we actually manage our final goal
>>> since probably it's going to take a while to revisit/fix and make it
>>> fully structured with the documentation.
>>> If we go more conservatively, we can add such information to the new
>>> configurations being added from now on at least, and keeping the existing
>>> configurations as are.
>>>
>>>
>>> On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <dongjoon.hyun@gmail.com>
>>> wrote:
>>>
>>>> Thank you for raising the issue, Hyukjin.
>>>>
>>>> According to the current status of discussion, it seems that we are
>>>> able to agree on updating the non-structured configurations and keeping the
>>>> structured configuration AS-IS.
>>>>
>>>> I'm +1 for the revisiting the configurations if that is our direction.
>>>> If there is some mismatch in structured configurations, we may fix them
>>>> together.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <dmatrix@comcast.net>
>>>> wrote:
>>>>
>>>>> All are valid and valuable observations to put into practice:
>>>>>
>>>>> * structured and meaningful config names
>>>>> * explainable text or succinct description
>>>>> * easily accessible or searchable
>>>>>
>>>>> While these are aspirational but gradually doable if we make it part
>>>>> of the dev and review cycle. Often meaningful config names, like security,
>>>>> come as after thought.
>>>>>
>>>>> At the AMA in Amsterdam Spark Summit last year, a few developers
>>>>> lamented the lack of finding Spark configs—what they do; what they
are used
>>>>> for; when and why; and what are their default values.
>>>>>
>>>>> Though you one fetch them programmatically, one still has to know what
>>>>> specific config one islooking for.
>>>>>
>>>>> Cheers
>>>>> Jules
>>>>>
>>>>> Sent from my iPhone
>>>>> Pardon the dumb thumb typos :)
>>>>>
>>>>> On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <gurwls223@gmail.com>
wrote:
>>>>>
>>>>> 
>>>>> Yeah, that's one of my point why I dont want to document this in the
>>>>> guide yet.
>>>>>
>>>>> I would like to make sure dev people are on the same page that
>>>>> documenting is a better practice. I dont intend to force as a hard
>>>>> requirement; however, if that's pointed out, it should better to address.
>>>>>
>>>>>
>>>>> On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <cloud0fan@gmail.com> wrote:
>>>>>
>>>>>> In general I think it's better to have more detailed documents, but
>>>>>> we don't have to force everyone to do it if the config name is structured.
>>>>>> I would +1 to document the relationship of we can't tell it from
the config
>>>>>> names, e.g. spark.shuffle.service.enabled
>>>>>> and spark.dynamicAllocation.enabled.
>>>>>>
>>>>>> On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <gurwls223@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Also, I would like to hear other people' thoughts on here. Could
I
>>>>>>> ask what you guys think about this in general?
>>>>>>>
>>>>>>> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls223@gmail.com>님이
작성:
>>>>>>>
>>>>>>>> To do that, we should explicitly document such structured
>>>>>>>> configuration and implicit effect, which is currently missing.
>>>>>>>> I would be more than happy if we document such implied
>>>>>>>> relationship, *and* if we are very sure all configurations
are
>>>>>>>> structured correctly coherently.
>>>>>>>> Until that point, I think it might be more practical to simply
>>>>>>>> document it for now.
>>>>>>>>
>>>>>>>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having
another
>>>>>>>> issue on practice - whether to duplicate description between
configuration
>>>>>>>> code and doc. I have been asked to add description on configuration
code
>>>>>>>> regardlessly, and existing codebase doesn't. This configuration
is
>>>>>>>> widely-used one.
>>>>>>>> This is actually something we should fix too. in SQL configuration,
>>>>>>>> now we don't have such duplications as of
>>>>>>>> https://github.com/apache/spark/pull/27459 as it generates.
We
>>>>>>>> should do it in other configurations.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <
>>>>>>>> kabhwan.opensource@gmail.com>님이 작성:
>>>>>>>>
>>>>>>>>> I'm looking into the case of `spark.dynamicAllocation`
and this
>>>>>>>>> seems to be the thing to support my voice.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation
>>>>>>>>>
>>>>>>>>> I don't disagree with adding "This requires
>>>>>>>>> spark.shuffle.service.enabled to be set." in the description
of
>>>>>>>>> `spark.dynamicAllocation.enabled`. This cannot be inferred
implicitly,
>>>>>>>>> hence it should be better to have it.
>>>>>>>>>
>>>>>>>>> Why I'm in favor of structured configuration & implicit
effect
>>>>>>>>> over describing everything explicitly is there.
>>>>>>>>>
>>>>>>>>> 1. There're 10 configurations (if the doc doesn't miss
any other
>>>>>>>>> configuration) except `spark.dynamicAllocation.enabled`,
and only 4
>>>>>>>>> configurations are referred in the description of
>>>>>>>>> `spark.dynamicAllocation.enabled` - majority of config
keys are missing.
>>>>>>>>> 2. I think it's intentional, but the table starts
>>>>>>>>> with `spark.dynamicAllocation.enabled` which talks implicitly
but
>>>>>>>>> intuitively that if you disable this then everything
on dynamic allocation
>>>>>>>>> won't work. Missing majority of references on config
keys don't get it hard
>>>>>>>>> to understand.
>>>>>>>>> 3. Even `spark.dynamicAllocation` has bad case - see
>>>>>>>>> `spark.dynamicAllocation.shuffleTracking.enabled` and
>>>>>>>>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting
the
>>>>>>>>> structure of configuration. I think this is worse than
not explicitly
>>>>>>>>> mentioning the description. Let's assume the name has
>>>>>>>>> been `spark.dynamicAllocation.shuffleTracking.timeout`
- isn't it intuitive
>>>>>>>>> that setting `spark.dynamicAllocation.shuffleTracking.enabled`
to `false`
>>>>>>>>> would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?
>>>>>>>>>
>>>>>>>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having
another
>>>>>>>>> issue on practice - whether to duplicate description
between configuration
>>>>>>>>> code and doc. I have been asked to add description on
configuration code
>>>>>>>>> regardlessly, and existing codebase doesn't. This configuration
is
>>>>>>>>> widely-used one.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls223@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sure, adding "[DISCUSS]" is a good practice to label
it. I had to
>>>>>>>>>> do it although it might be "redundant" :-) since
anyone can give feedback
>>>>>>>>>> to any thread in Spark dev mailing list, and discuss.
>>>>>>>>>>
>>>>>>>>>> This is actually more prevailing given my rough reading
of
>>>>>>>>>> configuration files. I would like to see this missing
relationship as a bad
>>>>>>>>>> pattern, started from a personal preference.
>>>>>>>>>>
>>>>>>>>>> > Personally I'd rather not think someone won't
understand
>>>>>>>>>> setting `.enabled` to `false` means the functionality
is disabled and
>>>>>>>>>> effectively it disables all sub-configurations.
>>>>>>>>>> > E.g. when `spark.sql.adaptive.enabled` is `false`,
all the
>>>>>>>>>> configurations for `spark.sql.adaptive.*` are implicitly
no-op. For me this
>>>>>>>>>> is pretty intuitive and the one of major
>>>>>>>>>> > benefits of the structured configurations.
>>>>>>>>>>
>>>>>>>>>> I don't think this is a good idea we assume for users
to know
>>>>>>>>>> such contexts. One might think
>>>>>>>>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled`
can
>>>>>>>>>> partially enable the feature. It is better to be
explicit to
>>>>>>>>>> document since some of configurations are even difficult
for users to
>>>>>>>>>> confirm if it is working or not.
>>>>>>>>>> For instance, one might think setting
>>>>>>>>>> 'spark.eventLog.rolling.maxFileSize' automatically
enables rolling. Then,
>>>>>>>>>> they realise the log is not rolling later after the
file
>>>>>>>>>> size becomes bigger.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim
<
>>>>>>>>>> kabhwan.opensource@gmail.com>님이 작성:
>>>>>>>>>>
>>>>>>>>>>> I'm sorry if I miss something, but this is ideally
better to be
>>>>>>>>>>> started as [DISCUSS] as I haven't seen any reference
to have consensus on
>>>>>>>>>>> this practice.
>>>>>>>>>>>
>>>>>>>>>>> For me it's just there're two different practices
co-existing on
>>>>>>>>>>> the codebase, meaning it's closer to the preference
of individual (with
>>>>>>>>>>> implicitly agreeing that others have different
preferences), or it hasn't
>>>>>>>>>>> been discussed thoughtfully.
>>>>>>>>>>>
>>>>>>>>>>> Personally I'd rather not think someone won't
understand setting
>>>>>>>>>>> `.enabled` to `false` means the functionality
is disabled and effectively
>>>>>>>>>>> it disables all sub-configurations. E.g. when
`spark.sql.adaptive.enabled`
>>>>>>>>>>> is `false`, all the configurations for `spark.sql.adaptive.*`
are
>>>>>>>>>>> implicitly no-op. For me this is pretty intuitive
and the one of major
>>>>>>>>>>> benefits of the structured configurations.
>>>>>>>>>>>
>>>>>>>>>>> If we want to make it explicit, "all" sub-configurations
should
>>>>>>>>>>> have redundant part of the doc. More redundant
if the condition is nested.
>>>>>>>>>>> I agree this is the good step of "be kind" but
less pragmatic.
>>>>>>>>>>>
>>>>>>>>>>> I'd be happy to follow the consensus we would
make in this
>>>>>>>>>>> thread. Appreciate more voices.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon
<
>>>>>>>>>>> gurwls223@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > I don't plan to document this officially
yet
>>>>>>>>>>>> Just to prevent confusion, I meant I don't
yet plan to document
>>>>>>>>>>>> the fact that we should write the relationships
in configurations as a
>>>>>>>>>>>> code/review guideline in
>>>>>>>>>>>> https://spark.apache.org/contributing.html
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin
Kwon <gurwls223@gmail.com>님이
>>>>>>>>>>>> 작성:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I happened to review some PRs and I noticed
that some
>>>>>>>>>>>>> configurations don't have some information
>>>>>>>>>>>>> necessary.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To be explicit, I would like to make
sure we document the
>>>>>>>>>>>>> direct relationship between other configurations
>>>>>>>>>>>>> in the documentation. For example,
>>>>>>>>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
>>>>>>>>>>>>> can be only enabled when `spark.sql.adaptive.enabled`
is
>>>>>>>>>>>>> enabled. That's clearly documented.
>>>>>>>>>>>>> We're good in general given that we document
them in general
>>>>>>>>>>>>> in Apache Spark.
>>>>>>>>>>>>> See 'spark.task.reaper.enabled',
>>>>>>>>>>>>> 'spark.dynamicAllocation.enabled', 'spark.sql.parquet.filterPushdown',
etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, I noticed such a pattern that
such information is
>>>>>>>>>>>>> missing in some components in general,
for example,
>>>>>>>>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*`
and
>>>>>>>>>>>>> `spark.history.ui.acls.* `
>>>>>>>>>>>>>
>>>>>>>>>>>>> I hope we all start to document such
information. Logically
>>>>>>>>>>>>> users can't know the relationship and
I myself
>>>>>>>>>>>>> had to read the codes to confirm when
I review.
>>>>>>>>>>>>> I don't plan to document this officially
yet because to me it
>>>>>>>>>>>>> looks a pretty logical request to me;
however,
>>>>>>>>>>>>> let me know if you guys have some different
opinions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Mime
View raw message