spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Francke <>
Subject Re: Apache Training contribution for Spark - Feedback welcome
Date Fri, 26 Jul 2019 21:00:36 GMT

thanks for taking the time to comment.

We've discussed those issues during the proposal stage for the Incubator as
others brought them up as well. I can't remember all the details but let me
go through your points inline.

My reservation here is that as an Apache project, it might appear to
> 'bless' one set of materials as authoritative over all the others out
> there.

I understand why it might be seen that way and we need to make sure to
point out that we have no intention of becoming "The official Apache Spark
training" because that's not our intention at all.

> And there are already lots of good ones. For example, Jacek has
> long maintained a very comprehensive set of free Spark training
> materials at
> In comparison the slides I see proposed so far only seem like
> outlines?

Jacek is indeed doing a fantastic job (and I'm sure others as well).

In this case, however, a company decided to donate their internal material
- they didn't create this from scratch for the Apache Training project.
We want to encourage contributions and just because someone else has
already created material shouldn't stop us from accepting this.

The opposite in fact: There's very little collaboration - in general -
around training material.
Every company creates its own material as an asset to sell. There's very
little quality open-source material out there.
I'm not sure how many companies have created Spark training courses. I
wouldn't be surprised if it goes into the hundreds. And everyone draws the
same or very similar slides (what's an RDD, what's a DataFrame etc.)
We hope to change that and this contribution can be a first start.

We did some research around training and especially open-source training
before we started the initiative and there are some projects out there that
do this but all we found were silos with a relatively narrow focus and no
greater community.

Regarding your "outlines" comment: No, this is the "final" material
(pending review of course). With "Training" we mean training in the sense
that Cloudera, Databricks et. al. sell as well where an instructor-led
course is being given using slides. These slides can, but don't have to
speak for themselves. We're fine with the requirement that an experienced
instructor needs to give this training. But this is just this content.
We're also happy to accept other forms of content that are meant for a
different way of consumption (self-serve). We don't intend to write
exhaustive or authoritative documentation for projects.

It just frees people from having to do the tedious work of creating (and
updating) hundreds of slides.

It's also a separate project from Spark. We might have trouble
> ensuring the info is maintained and up to date, and sometimes outdated
> or incorrect info is worse than none - especially if it appears quasi
> official. The Spark project already maintains and updates its docs
> (which can always be better), so already has its hands full there.

Definitely. Outdated information is always a danger and I have no guarantee
that this isn't going to happen here.
The fact that this is hosted and governed by the ASF makes it less likely
to be completely abandoned though as there are clear processes in place for
collaboration that don't depend on a single person (which might be the case
with some of the other things that already exist).
We also hope that communities - like Spark - are also interested in
collaborating and while patches are always welcome so is creating a Jira to
point out outdated information.

> Personally, no strong objection here, but, what's the upside to
> running this as an ASF project vs just letting people continue to
> publish quality tutorials online?

Some points come to mind, this list is neither exhaustive nor do all points
apply equally to all the material that others have published:

- Clear and easy guidelines for collaboration
- Not a "bus factor" of one
- Everything is open-source with a friendly license and customizable
- We're still just getting started but because we already have four or five
different contributions we can share one technology stack between all of
them making it easier to collaborate ("everything looks familiar") and
every piece of content benefits from improvements in the technical stack
- We hope to have non-tool focused sessions later as well (e.g. Ingesting
data from Kafka into Elasticsearch using Spark [okay, this would maybe be a
bit too specific for now but something along the lines of a "Data
Ingestion" training]) where we can mix and match from the content we have

I'd have to dig into the original discuss threads in the incubator to find
more but I hope this helps a bit?


> On Fri, Jul 26, 2019 at 9:00 AM Lars Francke <>
> wrote:
> >
> > Hi Spark community,
> >
> > you may or may not have heard of a new-ish (February 2019) project at
> Apache: Apache Training (incubating). We aim to develop training material
> about various projects inside and outside the ASF: <
> >
> > One of our users wants to contribute material on Spark[1]
> >
> > We've done something similar for ZooKeeper[1] in the past and the
> ZooKeeper community provided excellent feedback which helped make the
> product much better[3].
> >
> > That's why I'd like to invite everyone here to provide any kind of
> feedback on the content donation. It is currently in PowerPoint format
> which makes it a bit harder to review so we're happy to accept feedback in
> any form.
> >
> > The idea is to convert the material to AsciiDoc at some point.
> >
> > Cheers,
> > Lars
> >
> > (I didn't want to cross post to user@ as well but this is obviously not
> limited to dev@ users)
> >
> > [1] <>
> > [2] <>
> > [3] You can see the content here <
> >

View raw message