spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Du Li <>
Subject how to use rdd.countApprox
Date Wed, 06 May 2015 14:53:53 GMT
I have to count RDD's in a spark streaming app. When data goes large, count() becomes expensive.
Did anybody have experience using countApprox()? How accurate/reliable is it? 
The documentation is pretty modest. Suppose the timeout parameter is in milliseconds. Can
I retrieve the count value by calling getFinalValue()? Does it block and return only after
the timeout? Or do I need to define onComplete/onFail handlers to extract count value from
the partial results?
View raw message