spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pfeiffer <...@preferred.jp>
Subject Closing over a var with changing value in Streaming application
Date Wed, 21 Jan 2015 07:23:50 GMT
Hi,

I am developing a Spark Streaming application where I want every item in my
stream to be assigned a unique, strictly increasing Long. My input data
already has RDD-local integers (from 0 to N-1) assigned, so I am doing the
following:

  var totalNumberOfItems = 0L
  // update the keys of the stream data
  val globallyIndexedItems = inputStream.map(keyVal =>
      (keyVal._1 + totalNumberOfItems, keyVal._2))
  // increase the number of total seen items
  inputStream.foreachRDD(rdd => {
    totalNumberOfItems += rdd.count
  })

Now this works on my local[*] Spark instance, but I was wondering if this
is actually an ok thing to do. I don't want this to break when going to a
YARN cluster...

The function increasing totalNumberOfItems is closing over a var and
running in the driver, so I think this is ok. Here is my concern: What
about the function in the inputStream.map(...) block? This one is closing
over a var that has a different value in every interval. Will the closure
be serialized with that new value in every interval? Or only once with the
initial value and this will always be 0 during the runtime of the program?

As I said, it works locally, but I was wondering if I can really assume
that the closure is serialized with a new value in every interval.

Thanks,
Tobias

Mime
View raw message