spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Streaming: which code is (not) executed at every batch interval?
Date Tue, 04 Nov 2014 20:44:30 GMT
Maybe you are looking for updateStateByKey?
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

You can use broadcast to efficiently send info to all the workers, if
you have some other data that's immutable, like in a local file, that
needs to be distributed.

On Tue, Nov 4, 2014 at 8:38 PM, Steve Reinhardt <spr@yarcdata.com> wrote:
>
> -----Original Message-----
> From: Sean Owen <sowen@cloudera.com>
>
>>On Tue, Nov 4, 2014 at 8:02 PM, spr <spr@yarcdata.com> wrote:
>>> To state this another way, it seems like there's no way to straddle the
>>> streaming world and the non-streaming world;  to get input from both a
>>> (vanilla, Linux) file and a stream.  Is that true?
>>>
>>> If so, it seems I need to turn my (vanilla file) data into a second
>>>stream.
>>
>>Hm, why do you say that? nothing prevents that at all. You can do
>>anything you like in your local code, or in functions you send to
>>remote workers. (Of course, if those functions depend on a local file,
>>it has to exist locally on the workers.) You do have to think about
>>the distributed model here, but what executes locally/remotely isn't
>>mysterious. It is things in calls to Spark API method that will be
>>executed remotely.
>
> The distinction I was calling out was temporal, not local/distributed,
> though that is another important dimension.  It sounds like I can do
> anything I want in the code before the ssc.start(), but that code runs
> once at the beginning of the program.  What I'm searching for is some way
> to have code that runs repeatedly and potentially updates a variable that
> the Streaming code will see.  Broadcast() almost does that, but apparently
> the underlying variable should be immutable.  I'm not aware of any (Spark)
> way to have code run repeatedly other than as part of the Spark Streaming
> API, but that doesn't look at vanilla files.
>
> The distributed angle you raise makes my "vanilla file" approach not quite
> credible, in that the vanilla file would have to be distributed to all the
> nodes for the updates to be seen.  So maybe the simplest way to do that is
> to have a vanilla Linux code monitoring the vanilla file (on a client
> node) and sending any changes to it into a (distinct) stream.  If so, the
> remote code would need to monitor both that stream and the main data
> stream.  Does that make sense?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message