spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ji ZHANG <>
Subject Implement Count by Minute in Spark Streaming
Date Sun, 26 Oct 2014 11:03:51 GMT

Suppose I have a stream of logs and I want to count them by minute.
The result is like:

2014-10-26 18:38:00 100
2014-10-26 18:39:00 150
2014-10-26 18:40:00 200

One way to do this is to set the batch interval to 1 min, but each
batch would be quite large.

Or I can use updateStateByKey where key is like '2014-10-26 18:38:00',
but I have two questions:

1. How to persist the result to MySQL? Do I need to flush them every batch?
2. How to delete the old state? For example, now is 18:50 but the
18:40's state is still in Spark. One solution is to set the key's
state to None when there's no data of this key in this batch. But what
if the log is not so much, and some batches get zero logs? For

18:40:00~18:40:10 has 10 logs -> key 18:40's value is set to 10
18:40:10~18:40:20 has no log -> key 18:40 is deleted
18:40:20~18:40:30 has 5 logs -> key 18:40's value is set to 5

You can see the result is wrong. Maybe I can use an 'update' approach
when flushing, i.e. check MySQL whether there's already an entry of
18:40 and add the result to that. But how about a unique count? I
can't store all unique values in MySQL per se.

So I'm looking for a better way to store count-by-minute result into
rdbms (or nosql?). Any idea would be appreciated.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message