spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: How many partitions is my RDD split into?
Date Mon, 24 Mar 2014 04:53:44 GMT
It's much simpler: rdd.partitions.size


On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Hey there fellow Dukes of Data,
>
> How can I tell how many partitions my RDD is split into?
>
> I'm interested in knowing because, from what I gather, having a good
> number of partitions is good for performance. If I'm looking to understand
> how my pipeline is performing, say for a parallelized write out to HDFS,
> knowing how many partitions an RDD has would be a good thing to check.
>
> Is that correct?
>
> I could not find an obvious method or property to see how my RDD is
> partitioned. Instead, I devised the following thingy:
>
> def f(idx, itr): yield idx
>
> rdd = sc.parallelize([1, 2, 3, 4], 4)
> rdd.mapPartitionsWithIndex(f).count()
>
> Frankly, I'm not sure what I'm doing here, but this seems to give me the
> answer I'm looking for. Derp. :)
>
> So in summary, should I care about how finely my RDDs are partitioned? And
> how would I check on that?
>
> Nick
>
>
> ------------------------------
> View this message in context: How many partitions is my RDD split into?<http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at
Nabble.com.
>

Mime
View raw message