spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetić <otis.gospodne...@gmail.com>
Subject Re: Apache Flink
Date Mon, 18 Apr 2016 00:09:02 GMT
While Flink may not be younger than Spark, Spark came to Apache first,
which always helps.  Plus, there was already a lot of buzz around Spark
before it came to Apache.  Coming from Berkeley also helps.

That said, Flink seems decently healthy to me:
- http://search-hadoop.com/?fc_project=Flink&fc_type=mail+_hash_+user&q=
- http://search-hadoop.com/?fc_project=Flink&fc_type=mail+_hash_+dev&q=
-
http://search-hadoop.com/?fc_project=Flink&fc_type=issue&q=&startDate=1445472000000&endDate=1461024000000

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Sun, Apr 17, 2016 at 5:55 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Assuming that both Spark and Flink are contemporaries what are the reasons
> that Flink has not been adopted widely? (this may sound obvious and or
> prejudged). I mean Spark has surged in popularity in the past year if I am
> correct
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 22:49, Michael Malak <michaelmalak@yahoo.com> wrote:
>
>> In terms of publication date, a paper on Nephele was published in 2009,
>> prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of
>> Stratosphere, which became Flink.
>>
>>
>> ------------------------------
>> *From:* Mark Hamstra <mark@clearstorydata.com>
>> *To:* Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> *Cc:* Corey Nolet <cjnolet@gmail.com>; "user @spark" <
>> user@spark.apache.org>
>> *Sent:* Sunday, April 17, 2016 3:30 PM
>> *Subject:* Re: Apache Flink
>>
>> To be fair, the Stratosphere project from which Flink springs was started
>> as a collaborative university research project in Germany about the same
>> time that Spark was first released as Open Source, so they are near
>> contemporaries rather than Flink having been started only well after Spark
>> was an established and widely-used Apache project.
>>
>> On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>> Also it always amazes me why they are so many tangential projects in Big
>> Data space? Would not it be easier if efforts were spent on adding to Spark
>> functionality rather than creating a new product like Flink?
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 17 April 2016 at 21:08, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>> Thanks Corey for the useful info.
>>
>> I have used Sybase Aleri and StreamBase as commercial CEPs engines.
>> However, there does not seem to be anything close to these products in
>> Hadoop Ecosystem. So I guess there is nothing there?
>>
>> Regards.
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 17 April 2016 at 20:43, Corey Nolet <cjnolet@gmail.com> wrote:
>>
>> i have not been intrigued at all by the microbatching concept in Spark. I
>> am used to CEP in real streams processing environments like Infosphere
>> Streams & Storm where the granularity of processing is at the level of each
>> individual tuple and processing units (workers) can react immediately to
>> events being received and processed. The closest Spark streaming comes to
>> this concept is the notion of "state" that that can be updated via the
>> "updateStateBykey()" functions which are only able to be run in a
>> microbatch. Looking at the expected design changes to Spark Streaming in
>> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
>> the radar for Spark, though I have seen articles stating that more effort
>> is going to go into the Spark SQL layer in Spark streaming which may make
>> it more reminiscent of Esper.
>>
>> For these reasons, I have not even tried to implement CEP in Spark. I
>> feel it's a waste of time without immediate tuple-at-a-time processing.
>> Without this, they avoid the whole problem of "back pressure" (though keep
>> in mind, it is still very possible to overload the Spark streaming layer
>> with stages that will continue to pile up and never get worked off) but
>> they lose the granular control that you get in CEP environments by allowing
>> the rules & processors to react with the receipt of each tuple, right away.
>>
>> Awhile back, I did attempt to implement an InfoSphere Streams-like API
>> [1] on top of Apache Storm as an example of what such a design may look
>> like. It looks like Storm is going to be replaced in the not so distant
>> future by Twitter's new design called Heron. IIRC, Heron does not have an
>> open source implementation as of yet.
>>
>> [1] https://github.com/calrissian/flowmix
>>
>> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>> Hi Corey,
>>
>> Can you please point me to docs on using Spark for CEP? Do we have a set
>> of CEP libraries somewhere. I am keen on getting hold of adaptor libraries
>> for Spark something like below
>>
>>
>>
>> ​
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 17 April 2016 at 16:07, Corey Nolet <cjnolet@gmail.com> wrote:
>>
>> One thing I've noticed about Flink in my following of the project has
>> been that it has established, in a few cases, some novel ideas and
>> improvements over Spark. The problem with it, however, is that both the
>> development team and the community around it are very small and many of
>> those novel improvements have been rolled directly into Spark in subsequent
>> versions. I was considering changing over my architecture to Flink at one
>> point to get better, more real-time CEP streaming support, but in the end I
>> decided to stick with Spark and just watch Flink continue to pressure it
>> into improvement.
>>
>> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers <koert@tresata.com>
>> wrote:
>>
>> i never found much info that flink was actually designed to be fault
>> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
>> doesn't bode well for large scale data processing. spark was designed with
>> fault tolerance in mind from the beginning.
>>
>> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>> Hi,
>>
>> I read the benchmark published by Yahoo. Obviously they already use Storm
>> and inevitably very familiar with that tool. To start with although these
>> benchmarks were somehow interesting IMO, it lend itself to an assurance
>> that the tool chosen for their platform is still the best choice. So
>> inevitably the benchmarks and the tests were done to support primary their
>> approach.
>>
>> In general anything which is not done through TCP Council or similar body
>> is questionable..
>> Their argument is that because Spark handles data streaming in micro
>> batches then inevitably it introduces this in-built latency as per design.
>> In contrast, both Storm and Flink do not (at the face value) have this
>> issue.
>>
>> In addition as we already know Spark has far more capabilities compared
>> to Flink (know nothing about Storm). So really it boils down to the
>> business SLA to choose which tool one wants to deploy for your use case.
>> IMO Spark micro batching approach is probably OK for 99% of use cases. If
>> we had in built libraries for CEP for Spark (I am searching for it), I
>> would not bother with Flink.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.marcu@inria.fr> wrote:
>>
>> You probably read this benchmark at Yahoo, any comments from Spark?
>>
>> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>> <https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at?soc_src=mail&soc_trk=ma>
>>
>>
>> On 17 Apr 2016, at 12:41, andy petrella <andy.petrella@gmail.com> wrote:
>>
>> Just adding one thing to the mix: `that the latency for streaming data is
>> eliminated` is insane :-D
>>
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>  It seems that Flink argues that the latency for streaming data is
>> eliminated whereas with Spark RDD there is this latency.
>>
>> I noticed that Flink does not support interactive shell much like Spark
>> shell where you can add jars to it to do kafka testing. The advice was to
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>>
>> Most Flink documentation also rather sparce with the usual example of
>> word count which is not exactly what you want.
>>
>> Anyway I will have a look at it further. I have a Spark Scala streaming
>> Kafka program that works fine in Spark and I want to recode it using Scala
>> for Flink with Kafka but have difficulty importing and testing libraries.
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 17 April 2016 at 02:41, Ascot Moss <ascot.moss@gmail.com> wrote:
>>
>> I compared both last month, seems to me that Flink's MLLib is not yet
>> ready.
>>
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>> Thanks Ted. I was wondering if someone is using both :)
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 16 April 2016 at 17:08, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>> Looks like this question is more relevant on flink mailing list :-)
>>
>> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>> Hi,
>>
>> Has anyone used Apache Flink instead of Spark by any chance
>>
>> I am interested in its set of libraries for Complex Event Processing.
>>
>> Frankly I don't know if it offers far more than Spark offers.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>>
>>
>> --
>> andy
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Mime
View raw message