spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Snively <>
Subject Re: Roadblock with Spark 0.8.0 ActorStream
Date Fri, 04 Oct 2013 15:27:25 GMT
Hi Prashant!

That's very interesting. So this appears to be a general issue with ActorStream, then?

With all DStream's I suppose. 
OK, so that seems like a bug independent of the issue around having no ActorSystem in scope
for runJob(), then.

I have not spent enough time on spray-io yet, but if it is implemented as ActorExtension,
then using actorStream is the best fit. For example I have used akka-camel/akka-zeromq to
receive streams from variety of endpoints. I am not suggesting you should use it, although
it has SSL support etc. That was the original motivation while designing it. But if it can
be somehow used in a different and more useful way, I had encourage it. 

 I'm reluctant to add another abstraction layer here, since the motivation for using spray-io
in the first place is its extreme scalability, and we mostly want to rely on Spark Streaming
here for managing availability and recovery. To me, it seems like we've identified two obvious
issues: one revolving around foreach ... foreach ... over a DStream not working correctly,
the other revolving around runJob() not having an ActorSystem in scope. While I understand
that there is much other work to do, both of these seem, first of all, rather serious, and
secondly, as if they should have relatively easy fixes. Of course, I'm open to being corrected
in this belief. :-)

I had like to understand a bit further, as to what are your ultimate intention in a bit more
detail on architecture from a higher level. Probably I am hinting on a redesign.
That's certainly a valid option!

Without discussing things I'm not at liberty to, all I can really say is that I'm trying to
build a real-time streaming server that takes a connection from a client that expects to be
able to then send arbitrary bytes to the server. On the server, I want a DStream per socket
connection to process the incoming data. There's other stuff, such as this needing to support
TLS mutual authentication, but I'll get there when I have data flowing. :-)

It's really quite simple, or at least it should be, and again, I believe it will be, when
the bugs we've identified are fixed.

In your case, the code is still complicated to spot where exactly we are leaking an ActorRef
in the test case ?
If it helps, the only ActorRef in my code now is in the SpraySslClient, i.e. nowhere that's
being serialized.

or in our Handler. 
I have to respectfully suggest it is indeed this. OTOH, is it really a "leak?" I am, in fact,
closing over objects that, sooner or later, construct ActorRefs. The whole point of my task
is to do something with data coming from an ActorRef that represents a TCP connection, then
send the result back to that same ActorRef. I guess I'm having a little trouble understanding
how many actorStream use cases there are that wouldn't involve ultimately closing over at
least one ActorRef. :-)

It is also possible to start multiple actorStream receiver in advance and then using their
supervisor to start more actor as receiver for you, for example on each connection (if that
makes any sense, take a look at it). 
I'm afraid I didn't quite follow that, but it sounds interesting!

At this point, though, I have to suggest that the path forward, simply for Spark Streaming
to work correctly out of the box, is clear. I have created an issue against the missing ActorSystem
in runJob() in JIRA, but it sounds like we need another one for the foreach ... foreach ...
issue. I'm afraid I don't know exactly how to characterize that issue. Could I possibly impose
upon you to create that ticket?

Thanks again for digging into tis!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message