spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Lodi <stefano.l...@unibo.it>
Subject countApprox
Date Thu, 15 Sep 2016 17:20:34 GMT
I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran countApprox with
different parameters but I failed to generate any approximate output. In all runs it returns
the exact number of elements. What is the effect of approximation in countApprox supposed
to be, and for what inputs and parameters?

>>> rdd = sc.parallelize([random.choice(range(1000)) for i in range(10**8)], 50)
>>> rdd.countApprox(1, 0.8)
[Stage 12:>                                                        (0 + 0) / 50]16/09/15
15:45:28 WARN TaskSetManager: Stage 12 contains a task of very large size (5402 KB). The maximum
recommended task size is 100 KB.
[Stage 12:======================================================> (49 + 1) / 50]100000000
>>> rdd.countApprox(1, 0.01)
16/09/15 15:45:45 WARN TaskSetManager: Stage 13 contains a task of very large size (5402 KB).
The maximum recommended task size is 100 KB.
[Stage 13:====================================================>   (47 + 3) / 50]100000000


Mime
View raw message