spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <>
Subject Re: Spark SQL upgrade / migration guide: discoverability and content organization
Date Mon, 15 Jul 2019 05:07:59 GMT
Thank you, Josh and Xiao. That sounds great.

Do you think we can have some parts of that improvement in `2.4.4` document
first since that is the very next release?


On Sun, Jul 14, 2019 at 4:25 PM Xiao Li <> wrote:

> Yeah, Josh! All these ideas sound good to me. All the top commercial
> database products have very detailed guide/document about the version
> upgrading. You can easily find them.
> Currently, only SQL and ML modules have the migration or upgrade guides.
> Since Spark 2.3 release, we strictly require the PR authors to document all
> the behavior changes in the SQL component. I would suggest to do the same
> things in the other modules. For example, Spark Core and Structured
> Streaming. Any objection?
> Cheers,
> Xiao
> On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen <> wrote:
>> I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
>> documentation: these are valuable resources and I think we could increase
>> that value by making these docs easier to discover and by adding a bit more
>> structure to the existing content.
>> For folks who aren't familiar with these docs: the Spark docs have a "SQL
>> Migration Guide" which lists the deprecations and changes of behavior in
>> each release:
>>    - Latest published version:
>>    - Master branch version (will become 3.0):
>> A lot of community work went into crafting this doc and I really
>> appreciate those efforts.
>> This doc is a little hard to find, though, because it's not consistently
>> linked from release notes pages: the 2.4.0 page links it under "Changes of
>> Behavior" (
>> but subsequent maintenance releases do not link to it (
>> It's also
>> not very cross-linked from the rest of the Spark docs (e.g. the Overview
>> doc, doc drop-down menus, etc).
>> I'm also concerned that the doc may be overwhelming to end users (as
>> opposed to Spark developers):
>>    - *Entries aren't grouped by component*, so users need to read the
>>    entire document to spot changes relevant to their use of Spark (for
>>    example, PySpark changes are not grouped together).
>>    - *Entries aren't ordered by size / risk of change,* e.g. performance
>>    impact vs. loud behavior change (stopping with an explicit exception) vs.
>>    silent behavior changes (e.g. changing default rounding behavior). If we
>>    assume limited reader attention then it may be important to prioritize the
>>    order in which we list entries, putting the highest-expected-impact /
>>    lowest-organic-discoverability changes first.
>>    - *We don't link JIRAs*, forcing users to do their own archaeology to
>>    learn more about a specific change.
>> The existing ML migration guide addresses some of these issues, so maybe
>> we can emulate it in the SQL guide:
>> I think that documentation clarity is especially important with Spark 3.0
>> around the corner: many folks will seek out this information when they
>> upgrade, so improving this guide can be a high-leverage, high-impact
>> activity.
>> What do folks think? Does anyone have examples from other projects which
>> do a notably good job of crafting release notes / migration guides? I'd be
>> glad to help with pre-release editing after we decide on a structure and
>> style.
>> Cheers,
>> Josh
> --
> [image: Databricks Summit - Watch the talks]
> <>

View raw message