flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Discussion: two different computation approaches for flink and spark(Operator-centric, intermediate data-centric)
Date Mon, 10 Aug 2015 13:01:49 GMT

The terms are not yet really standard terms, we only use them to explain
the differences in the runtime design of the two systems. Maybe they will
sink into the data processing vocabulary...

First and most importantly, the two are not mutually exclusive. Both have
to deal with computation and with intermediate data/results. To a certain
extend, both Flink and Spark implement both, but with different assumption
and prominence of the concepts:


This model describes computation as operations/transformations on data
streams, which can come from data sources, or previous operations, or
cached data.

Proper data streaming architectures are more natural in this model, because
long lived operations ans "continuous computation" (as true streaming
assumes) are in its heart.

For stateful operations, this model is a good match as well, in my opinion,
as it does not make the assumption of "side effect"-free transformations of
the data sets.

Concerning iterations: Iterations work well in this model. They do not work
as rolled-out loops (as in Spark), but as "closed loops". So far, all
performance experiments I have seen on iterative computations in both
approaches have favored this model over the "intermediate results".

For streaming iterations with asynchronous feedback, this the operator
model works well as well, as it again provides continuous feedback, rather
than batched.

Intermediate result centric:
The main advantages I see for this model are a clear handle to intermediate
results to re-use them across jobs.
That means (a) in interactive sessions (multiple queries against same
intermediate data set) and (b) in different programs that share
intermediate results.

This model is a low-hanging fruit for iterative jobs, but it boils down to
rolled-out loops, which may not be as efficient as closed loops.


As said in the beginning, the two are not exclusive. By now, Flink actually
has most mechanisms in place to also represent intermediate results as
first-class citizens, to allow sharing across applications.

Intermediate results are in Flink simply streams, which may be fully
buffered/cached for cases where one wants to consume them multiple times or
share them.

The code for that is not fully merged, yet.

I hope that helped shed some light.


On Thu, Aug 6, 2015 at 9:26 PM, Chenliang (Liang, DataSight) <
chenliang613@huawei.com> wrote:

> Hi
> Flink and Spark take different approaches to do computation, each of which
> has its pros and cons.
> Who can elaborate the pros and cons for “Operator-centric, intermediate
> data-centric”? Any help would be very appreciated.
> First of all, share my understanding:
> *Operator-centric:* Could have more performance for steaming scenarios,
> don’t need to wait data together for computation.
> *Intermediate data-centirc:* Easy to do integration across different
> system to share intermediate data, especially for data mining and machine
> learning, need to reuse intermediate data for multiple iteration
> computation,the approach would be more efficient.
> Regards
> Liang
> Huawei Technologies Co., Ltd.

View raw message