spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From madeleine <>
Subject zip in pyspark truncates RDD to number of processors
Date Sat, 21 Jun 2014 16:37:50 GMT
Consider the following simple zip:

n = 6
a = sc.parallelize(range(n))
b = sc.parallelize(range(n)).map(lambda j: j) 
c =
print a.count(), b.count(), c.count()

>> 6 6 4

by varying n, I find that c.count() is always min(n,4), where 4 happens to
be the number of threads on my computer. by calling c.collect(), I see the
RDD has simply been truncated to the first 4 entries. weirdly, this doesn't
happen without calling map on b.

Any ideas?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message