thanks for taking the time to comment.

We've discussed those issues during the proposal stage for the Incubator as others brought them up as well. I can't remember all the details but let me go through your points inline.

My reservation here is that as an Apache project, it might appear to
'bless' one set of materials as authoritative over all the others out

I understand why it might be seen that way and we need to make sure to point out that we have no intention of becoming "The official Apache Spark training" because that's not our intention at all.
And there are already lots of good ones. For example, Jacek has
long maintained a very comprehensive set of free Spark training
materials at
In comparison the slides I see proposed so far only seem like

Jacek is indeed doing a fantastic job (and I'm sure others as well).

In this case, however, a company decided to donate their internal material - they didn't create this from scratch for the Apache Training project.
We want to encourage contributions and just because someone else has already created material shouldn't stop us from accepting this.

The opposite in fact: There's very little collaboration - in general - around training material.
Every company creates its own material as an asset to sell. There's very little quality open-source material out there.
I'm not sure how many companies have created Spark training courses. I wouldn't be surprised if it goes into the hundreds. And everyone draws the same or very similar slides (what's an RDD, what's a DataFrame etc.)
We hope to change that and this contribution can be a first start.

We did some research around training and especially open-source training before we started the initiative and there are some projects out there that do this but all we found were silos with a relatively narrow focus and no greater community.

Regarding your "outlines" comment: No, this is the "final" material (pending review of course). With "Training" we mean training in the sense that Cloudera, Databricks et. al. sell as well where an instructor-led course is being given using slides. These slides can, but don't have to speak for themselves. We're fine with the requirement that an experienced instructor needs to give this training. But this is just this content. We're also happy to accept other forms of content that are meant for a different way of consumption (self-serve). We don't intend to write exhaustive or authoritative documentation for projects.

It just frees people from having to do the tedious work of creating (and updating) hundreds of slides.

It's also a separate project from Spark. We might have trouble
ensuring the info is maintained and up to date, and sometimes outdated
or incorrect info is worse than none - especially if it appears quasi
official. The Spark project already maintains and updates its docs
(which can always be better), so already has its hands full there.

Definitely. Outdated information is always a danger and I have no guarantee that this isn't going to happen here.
The fact that this is hosted and governed by the ASF makes it less likely to be completely abandoned though as there are clear processes in place for collaboration that don't depend on a single person (which might be the case with some of the other things that already exist).
We also hope that communities - like Spark - are also interested in collaborating and while patches are always welcome so is creating a Jira to point out outdated information.
Personally, no strong objection here, but, what's the upside to
running this as an ASF project vs just letting people continue to
publish quality tutorials online?

Some points come to mind, this list is neither exhaustive nor do all points apply equally to all the material that others have published:

- Clear and easy guidelines for collaboration
- Not a "bus factor" of one
- Everything is open-source with a friendly license and customizable
- We're still just getting started but because we already have four or five different contributions we can share one technology stack between all of them making it easier to collaborate ("everything looks familiar") and every piece of content benefits from improvements in the technical stack
- We hope to have non-tool focused sessions later as well (e.g. Ingesting data from Kafka into Elasticsearch using Spark [okay, this would maybe be a bit too specific for now but something along the lines of a "Data Ingestion" training]) where we can mix and match from the content we have

I'd have to dig into the original discuss threads in the incubator to find more but I hope this helps a bit?


On Fri, Jul 26, 2019 at 9:00 AM Lars Francke <> wrote:
> Hi Spark community,
> you may or may not have heard of a new-ish (February 2019) project at Apache: Apache Training (incubating). We aim to develop training material about various projects inside and outside the ASF: <>
> One of our users wants to contribute material on Spark[1]
> We've done something similar for ZooKeeper[1] in the past and the ZooKeeper community provided excellent feedback which helped make the product much better[3].
> That's why I'd like to invite everyone here to provide any kind of feedback on the content donation. It is currently in PowerPoint format which makes it a bit harder to review so we're happy to accept feedback in any form.
> The idea is to convert the material to AsciiDoc at some point.
> Cheers,
> Lars
> (I didn't want to cross post to user@ as well but this is obviously not limited to dev@ users)
> [1] <>
> [2] <>
> [3] You can see the content here <>