spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Davidson <>
Subject newbie system architecture problem, trouble using streaming and RDD.pipe()
Date Tue, 30 Sep 2014 00:56:09 GMT

I am trying to build a system that does a very simple calculation on a
stream and displays the results in a graph that I want to update the graph
every second or so. I think I have a fundamental mis understanding about how
steams and rdd.pipe() works. I want to do the data visualization part using
Ipython notebook. Its really easy to graph, animate, and share the page. I
understand streaming does not work in python yet. Googleing around it
appears you can use RDD.pipe() to get the streaming data into python.

I have a little Ipython notebook I have been experimenting with. It use
rdd.pipe() to run the following java job The psudo code is

Main() {
   JavaStreamingContext ssc = new JavaStreamingContext(jsc, new

 JavaDStream<String> logs = createStream(ssc)
   JavaDStream<String> msg = logs.filter(selectMsgLevel);

JavaDStream<Long> count = msg.count()




I do not understand how I can pass any data back to python. If my
understanding is correct everything runs on a worker, there is no way for
the driver to get get the value of mini batch count and write it out to
standard out.

The Streaming documentation on Œdesign patterns for using foreachRDD
Πdemonstrates how the slaves/works can send data to other systems. So does
the over all architecture of my little demo need to be something like

Process a) iPython Note book rdd.pipe(

Process b) is basically some little daemon process that the
workers can connect to. It will just write what ever it receives to standard

Process c) is java spark streaming code

Do I really need the ³daemon in the middle² ?

Any insights would be greatly appreciated


P.s. I assume that if I wanted to use aggregates I would need to use
JDStream.wrappRDD() . Is this correct? Is it expensive?

View raw message