spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dimitris Kouzis - Loukas <look...@gmail.com>
Subject Streaming and calculated-once semantics
Date Wed, 05 Aug 2015 17:46:58 GMT
Hello, here's a simple program that demonstrates my problem:


ssc = StreamingContext(sc, 1)

input = [ [("k1",12), ("k2",14)], [("k1",22)] ]

rawData = ssc.queueStream([sc.parallelize(d, 1) for d in input])

runningRawData = rawData.updateStateByKey(lambda nv, prev: reduce(sum, nv,
prev or 0))

def stats(rdd) {
    keyavg = rdd.values().reduce(sum) / rdd.count()
    return rdd.mapValues(lambda i: i - keyavg)
}

runningRawData.transform(stats).print()


I have a feeling this will calculate "keyavg = rdd.values().reduce(sum) /
rdd.count()" inside stats quite a few times depending on the number of
partitions on the current rdd.

What would be an alternative way to do this two step computation without
calculating the average many times?

Mime
View raw message