spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From François Garillot <francois.garil...@typesafe.com>
Subject Re: Spark Streaming Data flow graph
Date Tue, 06 Jan 2015 15:14:08 GMT
Thanks a LOT for your answer ! I've updated the diagram, at the same
address :
https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0

I've addressed your more straightforward remarks directly in the diagram. A
couple questions:

- the location of instances (Executor, Master, Driver) is now marked, I
hope I didn't make too many mistakes there, did I ?

- Given that the communication between instances and their members (e.g.
ReceiverSupervisor / ReceivedBlockHandler) is willingly omitted, have I
forgotten any communication channels ?

- I've represented some queues / buffers using a red trapezoid. I'm thus
starting an inventory of queues or buffers, and I'm interested in adding
the 'implicit' ones as well (e.g. jobSets in JobScheduler, which is indexed
by time in ms). I'd be happy with pointers on where to look : ideally I'm
trying to see any place in the data flow where data is sitting idle for any
length of time, waiting to be chunked somehow (whether it's at the RDD or
block level doesn't really matter to me, I'm interested in all types of
'chunking').

Naturally, this is intended to be a developer document exclusively (hence
in particular why I'm not publicising this on the user ML).


On Mon, Jan 5, 2015 at 10:57 PM, Tathagata Das <tathagata.das1565@gmail.com>
wrote:

> Hey François,
>
> Well, at a high-level here is what I thought about the diagram.
>
> - ReceiverSupervisor handles only one Receiver.
> - BlockGenerator is part of ReceiverSupervisor not ReceivedBlockHandler
> - The blocks are inserted in BlockManager and if activated,
> WriteAheadLogManager in parallel, not through BlockManager as the
> diagram seems to imply
> - It would be good to have a clean visual separation of what runs in
> Executor (better term than Worker) and what is in Driver ... Driver
> stuff on left and Executor stuff on right, or vice versa.
>
> More importantly, the word of caution is that all the internal stuff
> like ReceiverBlockHandler, Supervisor, etc are subject to change any
> time as we keep refactoring stuff. So highlighting these internal
> details too much too publicly may lead to future confusion.
>
> TD
>
> On Thu, Dec 18, 2014 at 11:04 AM,  <francois.garillot@typesafe.com> wrote:
> > I’ve been trying to produce an updated box diagram to refresh :
> >
> http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26
> >
> >
> > … after the SPARK-3129, and other switches (a surprising number of
> comments still mention NetworkReceiver).
> >
> >
> > Here’s what I have so far:
> > https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0
> >
> >
> > This is not supposed to respect any particular convention (ER, ORM, …).
> Data flow up to right before RDD creation is in bold arrows, metadata flow
> is in normal width arrows.
> >
> >
> > This diagram is still very much a WIP (see below : todo), but I wanted
> to share it to ask:
> > - what’s wrong ?
> > - what are the glaring omissions ?
> > - how can I make this better (i.e. what should I add first to the
> Todo-list below) ?
> >
> >
> > I’ll be happy to share this (including sources) with whoever asks for it.
> >
> >
> > Todo :
> > - mark private/public classes
> > - mark queues in Receiver, ReceivedBlockHandler, BlockManager
> > - mark type of info on transport : e.g. Actor message, ReceivedBlockInfo
> >
> >
> >
> > —
> > François Garillot
>



-- 
François Garillot

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message