spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From qihong <>
Subject how to setup steady state stream partitions
Date Wed, 10 Sep 2014 03:03:08 GMT
I'm working on a DStream application.  The input are sensors' measurements, 
the data format is <sensor id><timestamp><measure>

There are 10 thousands sensors, and updateStateByKey is used to maintain
the states of sensors, the code looks like following:

val inputDStream = ...
val keyedDStream =  // use sensorId as key
val stateDStream = keyedDStream.updateStateByKey[...](udpateFunction)

Here's the question:
In a cluster with 10 worker nodes, is it possible to partition the input
dstream, so that node 1 handles sendor 0-999, node 2 handles 1000-1999,
and so on?

Also, is it possible to keep state stream for sensor 0 - 999 on node 1, 1000
to 1999 on node 2, and etc. Right now, I see sensor state stream is shuffled
for every batch, which used lot of network bandwidth and it's unnecessary.

Any suggestions?


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message