spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <atan...@adobe.com>
Subject Re: Spark StreamingStatefull information
Date Thu, 22 Oct 2015 10:01:14 GMT
The result of updatestatebykey is a dstream that emits the entire state every batch - as an
RDD - nothing special about it.

It easy to join / cogroup with another RDD if you have the correct keys in both.
You could load this one when the job starts and/or have it update with updatestatebykey as
well, based on streaming updates from cassandra.

Sent from my iPhone

> On 22 Oct 2015, at 12:54, Arttii <a.topchyan@reply.de> wrote:
> 
> Hi,
> 
> So I am working on a usecase, where Clients are walking in and out of
> geofences and sendingmessages based on that.
> I currently have some in Memory Broadcast vars to do certain lookups for
> client and geofence info, some of this is also coming from Cassandra.
> My current quandry is that I need to support the case where a user comes in
> and out of geofence and also track how many messages have already been sent
> and do some logic based on that.
> 
> My stream is basically a bunch  of jsons
> {
> member:""
> beacon
> state:"exit","enter"
> }
> 
> 
> This information is invalidated at certain timesteps say messages a day and
> geofence every few minutes. Frist I thought if broadcast vars are good for
> this, but this gets updated a bunch so i do not think I can peridically
> rebroadcast these from the driver.
> 
> So I was thinking this might be a perfect case for UpdateStateByKey as I can
> kinda track what is going
> and also track the time inside the values and return Nones to "pop" things.
> 
> Currently I cannot wrap my head around on how to use this stream in
> conjuction with some other info that is coming in "Dstreams" "Rdds". All the
> example for UpdateStatebyKey are basically doing something to a stream
> updateStateBykey and then foreaching over it and persisting in a store. I
> dont think writing and reading from cassandra on every batch to get this
> info is a good idea, because I might get stale info.
> 
> Is this a valid case or am I missing the point and usecase of this function?
> 
> Thanks,
> Artyom
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-StreamingStatefull-information-tp25160.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message