spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Das <>
Subject Re: Spark Improvement Proposals
Date Mon, 17 Oct 2016 02:21:10 GMT
Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
soon as I looked into it since compared to writing Java map-reduce and
Cascading code, Spark made writing distributed code fun...But now as we
went deeper with Spark and real-time streaming use-case gets more
prominent, I think it is time to bring a messaging model in conjunction
with the batch/micro-batch API that Spark is good at....akka-streams close
integration with spark micro-batching APIs looks like a great direction to
stay in the game with Apache Flink...Spark 2.0 integrated streaming with
batch with the assumption is that micro-batching is sufficient to run SQL
commands on stream but do we really have time to do SQL processing at
streaming data within 1-2 seconds ?

After reading the email chain, I started to look into Flink documentation
and if you compare it with Spark documentation, I think we have major work
to do detailing out Spark internals so that more people from community
start to take active role in improving the issues so that Spark stays
strong compared to Flink.

Spark is no longer an engine that works for micro-batch and batch...We (and
I am sure many others) are pushing spark as an engine for stream and query
processing.....we need to make it a state-of-the-art engine for high speed
streaming data and user queries as well !

On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <>

> Hi everyone,
> I'm quite late with my answer, but I think my suggestions may help a
> little bit. :) Many technical and organizational topics were mentioned,
> but I want to focus on these negative posts about Spark and about "haters"
> I really like Spark. Easy of use, speed, very good community - it's
> everything here. But Every project has to "flight" on "framework market"
> to be still no 1. I'm following many Spark and Big Data communities,
> maybe my mail will inspire someone :)
> You (every Spark developer; so far I didn't have enough time to join
> contributing to Spark) has done excellent job. So why are some people
> saying that Flink (or other framework) is better, like it was posted in
> this mailing list? No, not because that framework is better in all
> cases.. In my opinion, many of these discussions where started after
> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
> posts, almost every post in "winned" by Flink. Answers are sometimes
> saying nothing about other frameworks, Flink's users (often PMC's) are
> just posting same information about real-time streaming, about delta
> iterations, etc. It look smart and very often it is marked as an aswer,
> even if - in my opinion - there wasn't told all the truth.
> My suggestion: I don't have enough money and knowledgle to perform huge
> performance test. Maybe some company, that supports Spark (Databricks,
> Cloudera? - just saying you're most visible in community :) ) could
> perform performance test of:
> - streaming engine - probably Spark will loose because of mini-batch
> model, however currently the difference should be much lower that in
> previous versions
> - Machine Learning models
> - batch jobs
> - Graph jobs
> - SQL queries
> People will see that Spark is envolving and is also a modern framework,
> because after reading posts mentioned above people may think "it is
> outdated, future is in framework X".
> Matei Zaharia posted excellent blog post about how Spark Structured
> Streaming beats every other framework in terms of easy-of-use and
> reliability. Performance tests, done in various environments (in
> example: laptop, small 2 node cluster, 10-node cluster, 20-node
> cluster), could be also very good marketing stuff to say "hey, you're
> telling that you're better, but Spark is still faster and is still
> getting even more fast!". This would be based on facts (just numbers),
> not opinions. It would be good for companies, for marketing puproses and
> for every Spark developer
> Second: real-time streaming. I've written some time ago about real-time
> streaming support in Spark Structured Streaming. Some work should be
> done to make SSS more low-latency, but I think it's possible. Maybe
> Spark may look at Gearpump, which is also built on top of Akka? I don't
> know yet, it is good topic for SIP. However I think that Spark should
> have real-time streaming support. Currently I see many posts/comments
> that "Spark has too big latency". Spark Streaming is doing very good
> jobs with micro-batches, however I think it is possible to add also more
> real-time processing.
> Other people said much more and I agree with proposal of SIP. I'm also
> happy that PMC's are not saying that they will not listen to users, but
> they really want to make Spark better for every user.
> What do you think about these two topics? Especially I'm looking at Cody
> (who has started this topic) and PMCs :)
> Pozdrawiam / Best regards,
> Tomasz
> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
> > I love Spark.  3 or 4 years ago it was the first distributed computing
> > environment that felt usable, and the community was welcoming.
> >
> > But I just got back from the Reactive Summit, and this is what I
> observed:
> >
> > - Industry leaders on stage making fun of Spark's streaming model
> > - Open source project leaders saying they looked at Spark's governance
> > as a model to avoid
> > - Users saying they chose Flink because it was technically superior
> > and they couldn't get any answers on the Spark mailing lists
> >
> > Whether you agree with the substance of any of this, when this stuff
> > gets repeated enough people will believe it.
> >
> > Right now Spark is suffering from its own success, and I think
> > something needs to change.
> >
> > - We need a clear process for planning significant changes to the
> codebase.
> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> > but you need a documented process with a clear outcome (e.g. a vote).
> > Passing around google docs after an implementation has largely been
> > decided on doesn't cut it.
> >
> > - All technical communication needs to be public.
> > Things getting decided in private chat, or when 1/3 of the committers
> > work for the same company and can just talk to each other...
> > Yes, it's convenient, but it's ultimately detrimental to the health of
> > the project.
> > The way structured streaming has played out has shown that there are
> > significant technical blind spots (myself included).
> > One way to address that is to get the people who have domain knowledge
> > involved, and listen to them.
> >
> > - We need more committers, and more committer diversity.
> > Per committer there are, what, more than 20 contributors and 10 new
> > jira tickets a month?  It's too much.
> > There are people (I am _not_ referring to myself) who have been around
> > for years, contributed thousands of lines of code, helped educate the
> > public around Spark... and yet are never going to be voted in.
> >
> > - We need a clear process for managing volunteer work.
> > Too many tickets sit around unowned, unclosed, uncertain.
> > If someone proposed something and it isn't up to snuff, tell them and
> > close it.  It may be blunt, but it's clearer than "silent no".
> > If someone wants to work on something, let them own the ticket and set
> > a deadline. If they don't meet it, close it or reassign it.
> >
> > This is not me putting on an Apache Bureaucracy hat.  This is me
> > saying, as a fellow hacker and loyal dissenter, something is wrong
> > with the culture and process.
> >
> > Please, let's change it.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail:
> >

View raw message