spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Malak <>
Subject Re: Apache Flink
Date Sun, 17 Apr 2016 21:59:54 GMT
As with all history, "what if"s are not scientifically testable hypotheses, but my speculation
is the energy (VCs, startups, big Internet companies, universities) within Silicon Valley
contrasted to Germany.

      From: Mich Talebzadeh <>
 To: Michael Malak <>; "user @spark" <>

 Sent: Sunday, April 17, 2016 3:55 PM
 Subject: Re: Apache Flink
Assuming that both Spark and Flink are contemporaries what are the reasons that Flink has
not been adopted widely? (this may sound obvious and or prejudged). I mean Spark has surged
in popularity in the past year if I am correct
Dr Mich Talebzadeh LinkedIn 
On 17 April 2016 at 22:49, Michael Malak <> wrote:

In terms of publication date, a paper on Nephele was published in 2009, prior to the 2010
USENIX paper on Spark. Nephele is the execution engine of Stratosphere, which became Flink.

      From: Mark Hamstra <>
 To: Mich Talebzadeh <> 
Cc: Corey Nolet <>; "user @spark" <>
 Sent: Sunday, April 17, 2016 3:30 PM
 Subject: Re: Apache Flink
To be fair, the Stratosphere project from which Flink springs was started as a collaborative
university research project in Germany about the same time that Spark was first released as
Open Source, so they are near contemporaries rather than Flink having been started only well
after Spark was an established and widely-used Apache project.
On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh <> wrote:

Also it always amazes me why they are so many tangential projects in Big Data space? Would
not it be easier if efforts were spent on adding to Spark functionality rather than creating
a new product like Flink?
Dr Mich Talebzadeh LinkedIn 
On 17 April 2016 at 21:08, Mich Talebzadeh <> wrote:

Thanks Corey for the useful info.
I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, there does not
seem to be anything close to these products in Hadoop Ecosystem. So I guess there is nothing

Dr Mich Talebzadeh LinkedIn 
On 17 April 2016 at 20:43, Corey Nolet <> wrote:

i have not been intrigued at all by the microbatching concept in Spark. I am used to CEP in
real streams processing environments like Infosphere Streams & Storm where the granularity
of processing is at the level of each individual tuple and processing units (workers) can
react immediately to events being received and processed. The closest Spark streaming comes
to this concept is the notion of "state" that that can be updated via the "updateStateBykey()"
functions which are only able to be run in a microbatch. Looking at the expected design changes
to Spark Streaming in Spark 2.0.0, it also does not look like tuple-at-a-time processing is
on the radar for Spark, though I have seen articles stating that more effort is going to go
into the Spark SQL layer in Spark streaming which may make it more reminiscent of Esper.
For these reasons, I have not even tried to implement CEP in Spark. I feel it's a waste of
time without immediate tuple-at-a-time processing. Without this, they avoid the whole problem
of "back pressure" (though keep in mind, it is still very possible to overload the Spark streaming
layer with stages that will continue to pile up and never get worked off) but they lose the
granular control that you get in CEP environments by allowing the rules & processors to
react with the receipt of each tuple, right away. 
Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on top of Apache
Storm as an example of what such a design may look like. It looks like Storm is going to be
replaced in the not so distant future by Twitter's new design called Heron. IIRC, Heron does
not have an open source implementation as of yet. 
On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <> wrote:

Hi Corey,
Can you please point me to docs on using Spark for CEP? Do we have a set of CEP libraries
somewhere. I am keen on getting hold of adaptor libraries for Spark something like below


Dr Mich Talebzadeh LinkedIn 
On 17 April 2016 at 16:07, Corey Nolet <> wrote:

One thing I've noticed about Flink in my following of the project has been that it has established,
in a few cases, some novel ideas and improvements over Spark. The problem with it, however,
is that both the development team and the community around it are very small and many of those
novel improvements have been rolled directly into Spark in subsequent versions. I was considering
changing over my architecture to Flink at one point to get better, more real-time CEP streaming
support, but in the end I decided to stick with Spark and just watch Flink continue to pressure
it into improvement.
On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers <> wrote:

i never found much info that flink was actually designed to be fault tolerant. if fault tolerance
is more bolt-on/add-on/afterthought then that doesn't bode well for large scale data processing.
spark was designed with fault tolerance in mind from the beginning.

On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <> wrote:

I read the benchmark published by Yahoo. Obviously they already use Storm and inevitably very
familiar with that tool. To start with although these benchmarks were somehow interesting
IMO, it lend itself to an assurance that the tool chosen for their platform is still the best
choice. So inevitably the benchmarks and the tests were done to support primary their approach.
In general anything which is not done through TCP Council or similar body is questionable..Their
argument is that because Spark handles data streaming in micro batches then inevitably it
introduces this in-built latency as per design. In contrast, both Storm and Flink do not
(at the face value) have this issue.
In addition as we already know Spark has far more capabilities compared to Flink (know nothing
about Storm). So really it boils down to the business SLA to choose which tool one wants to
deploy for your use case. IMO Spark micro batching approach is probably OK for 99% of use
cases. If we had in built libraries for CEP for Spark (I am searching for it), I would not
bother with Flink.

Dr Mich Talebzadeh LinkedIn 
On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU <> wrote:

You probably read this benchmark at Yahoo, any comments from Spark?

On 17 Apr 2016, at 12:41, andy petrella <> wrote:
Just adding one thing to the mix: `that the latency for streaming data is eliminated` is insane
On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <> wrote:

 It seems that Flink argues that the latency for streaming data is eliminated whereas with
Spark RDD there is this latency.
I noticed that Flink does not support interactive shell much like Spark shell where you can
add jars to it to do kafka testing. The advice was to add the streaming Kafka jar file to
CLASSPATH but that does not work.
Most Flink documentation also rather sparce with the usual example of word count which is
not exactly what you want.
Anyway I will have a look at it further. I have a Spark Scala streaming Kafka program that
works fine in Spark and I want to recode it using Scala for Flink with Kafka but have difficulty
importing and testing libraries.
Dr Mich Talebzadeh LinkedIn 
On 17 April 2016 at 02:41, Ascot Moss <> wrote:

I compared both last month, seems to me that Flink's MLLib is not yet ready.
On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <> wrote:

Thanks Ted. I was wondering if someone is using both :)
Dr Mich Talebzadeh LinkedIn 
On 16 April 2016 at 17:08, Ted Yu <> wrote:

Looks like this question is more relevant on flink mailing list :-)
On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <> wrote:

Has anyone used Apache Flink instead of Spark by any chance
I am interested in its set of libraries for Complex Event Processing.
Frankly I don't know if it offers far more than Spark offers.
Dr Mich Talebzadeh LinkedIn 



View raw message