spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soumitra Kumar <>
Subject Re: How to initialize updateStateByKey operation
Date Tue, 23 Sep 2014 16:41:49 GMT
I thought I did a good job ;-)

OK, so what is the best way to initialize updateStateByKey operation? I have counts from previous
spark-submit, and want to load that in next spark-submit job.

----- Original Message -----
From: "Soumitra Kumar" <>
To: "spark users" <>
Sent: Sunday, September 21, 2014 10:43:01 AM
Subject: How to initialize updateStateByKey operation

I started with StatefulNetworkWordCount to have a running count of words seen.

I have a file 'stored.count' which contains the word counts.

$ cat stored.count
a 1
b 2

I want to initialize StatefulNetworkWordCount with the values in 'stored.count' file, how
do I do that?

I looked at the paper 'EECS-2012-259.pdf' Matei et al, in figure 2, it would be useful to
have an initial RDD feeding into 'counts' at 't = 1', as below.

t = 1: pageView -> ones -> counts
t = 2: pageView -> ones -> counts

I have also attached the modified figure 2 of

I managed to hack Spark code to achieve this, and attaching the modified files.

Essentially, I added an argument 'initial : RDD [(K, S)]' to updateStateByKey method, as
    def updateStateByKey[S: ClassTag](
        initial : RDD [(K, S)],
        updateFunc: (Seq[V], Option[S]) => Option[S],
        partitioner: Partitioner
      ): DStream[(K, S)]

If it sounds interesting for larger crowd I would be happy to cleanup the code, and volunteer
to push into the code. I don't know the procedure to that though.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message