spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From andy petrella <>
Subject [brainsotrming] Generalization of DStream, a ContinuousRDD ?
Date Tue, 15 Jul 2014 17:11:01 GMT
Dear Sparkers,

*[sorry for the lengthy email... => head to the gist
<> for a preview

I would like to share some thinking I had due to a use case I faced.
Basically, as the subject announced it, it's a generalization of the
DStream currently available in the streaming project.
First of all, I'd like to say that it's only a result of some personal
thinking, alone in the dark with a use case, the spark code, a sheet of
paper and a poor pen.

DStream is a very great concept to deal with micro-batching use cases, and
it does it very well too!
Also, it hardly relies on the elapsing time to create its internal
However, there are similar use cases where we need micro-batches where this
constraint on the time doesn't hold, here are two of them:
* a micro-batch has to be created every *n* events received
* a micro-batch has to be generate based on the values of the items pushed
by the source (which might even not be a stream!).

An example of use case (mine ^^) would be
* the creation of timeseries from a cold source containing timestamped
events (like S3).
* one these timeseries have cells being the mean (sum, count, ...) of one
of the fields of the event
* the mean has to be computed over a window depending on a field *timestamp*.

* a timeserie is created for each type of event (the number of types is
So, in this case, it'd be interesting to have an RDD for each cell, which
will generate all cells for all neede timeseries.
It's more or less what DStream does, but here it won't help due what was
stated above.

That's how I came to a raw sketch of what could be named ContinuousRDD
(CRDD) which is basically and RDD[RDD[_]]. And, for the sake of simplicity
I've stuck with the definition of a DStream to think about it. Okay, let's
go ^^.

Looking at the DStream contract, here is something that could be drafted
around CRDD.
A *CRDD* would be a generalized concept that relies on:
* a reference space/continuum (to which data can be bound)
* a binning function that can breaks the continuum into splits.
Since *Space* is a continuum we could define it as:
* a *SpacePoint* (the origin)
* a SpacePoint=>SpacePoint (the continuous function)
* a Ordering[SpacePoint]

DStream uses a *JobGenerator* along with a DStreamGraph, which are using
timer and clock to do their work, in the case of a CRDD we'll have to
define also a point generator, as a more generic but also adaptable

So far (so good?), these definition should work quite fine for *ordered* space
for which:
* points are coming/fetched in order
* the space is fully filled (no gaps)
For these cases, the JobGenerator (f.i.) could be defined with two extra
* one is responsible to chop the batches even if the upper bound of the
batch hasn't been seen yet
* the other is responsible to handle outliers (and could wrap them into yet
another CRDD ?)

I created a gist here wrapping up the types and thus the skeleton of this
idea, you can find it here:

*The answer can be: you're a fool!*
Actually, I already I am, but also I like to know why.... so some
explanations will help me :-D.

Thanks to read 'till this point.


 aℕdy ℙetrella
[image: aℕdy ℙetrella on]


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message