spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Reinhardt <...@yarcdata.com>
Subject Re: Streaming: which code is (not) executed at every batch interval?
Date Tue, 04 Nov 2014 20:59:24 GMT
From: Sean Owen <sowen@cloudera.com>

>Maybe you are looking for updateStateByKey?
>http://spark.apache.org/docs/latest/streaming-programming-guide.html#trans
>formations-on-dstreams
>
>You can use broadcast to efficiently send info to all the workers, if
>you have some other data that's immutable, like in a local file, that
>needs to be distributed.

The second flow of data is coming from a human, so updating rarely by
streaming standards.  I'm agnostic about how to incorporate this second
flow, it just needs to work reasonably somehow.

My first approach was to put it in a file and monitor changes to that file
(in a Linux, non-Spark way) and then disseminate it to all the nodes
somehow.  
- Since the data is mutable, at first blush broadcast() seems a poor
match.  Or is there some way for the result of the broadcast to be a new
variable each time, in the streaming code?
- Is there a way for the (Linux, non-Spark) code to read the file and then
write it into a socket (say) stream that is (redundantly, not
partitionedly (word?)) written to all the nodes?  (I.e., it is a broadcast
in that sense.)

I hope my description makes sense.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message