spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Rosen <rosenvi...@gmail.com>
Subject Re: configuring final partition length
Date Tue, 19 Nov 2013 22:29:55 GMT
I think that the reduce() action is implemented as
mapPartitions.collect().reduce(), so the number of result tasks is
determined by the degree of parallelism of the RDD being reduced.

Some operations, like reduceByKey(), accept a `numPartitions` argument for
configuring the number of reducers:
https://spark.incubator.apache.org/docs/0.8.0/api/pyspark/index.html


On Sun, Nov 17, 2013 at 11:24 PM, Umar Javed <umarj.javed@gmail.com> wrote:

> I'm using pyspark. I was wondering how to modify the number of partitions
> for the result task (reduce in my case). I'm running Spark on a cluster of
> two machines (each with 16 cores). Here's the relevant log output for my
> result stage:
>
> 13/11/17 23:16:47 INFO SparkContext: time: 18851958895218046
> *13/11/17 23:16:47 INFO SparkContext: partition length: 2*
> 13/11/17 23:16:47 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.JobSubmitted
> 13/11/17 23:16:47 INFO DAGScheduler: class of RDD: class
> org.apache.spark.api.python.PythonRDD PythonRDD[6] at RDD at
> PythonRDD.scala:34
> 13/11/17 23:16:47 INFO DAGScheduler: number of dependencies: 1
> 13/11/17 23:16:47 INFO DAGScheduler: class of dep: class
> org.apache.spark.rdd.MappedRDD MappedRDD[5] at values at
> NativeMethodAccessorImpl.java:-2
> 13/11/17 23:16:47 INFO DAGScheduler: class of RDD: class
> org.apache.spark.rdd.MappedRDD MappedRDD[5] at values at
> NativeMethodAccessorImpl.java:-2
> 13/11/17 23:16:47 INFO DAGScheduler: number of dependencies: 1
> 13/11/17 23:16:47 INFO DAGScheduler: class of dep: class
> org.apache.spark.rdd.ShuffledRDD ShuffledRDD[4] at partitionBy at
> NativeMethodAccessorImpl.java:-2
> 13/11/17 23:16:47 INFO DAGScheduler: class of RDD: class
> org.apache.spark.rdd.ShuffledRDD ShuffledRDD[4] at partitionBy at
> NativeMethodAccessorImpl.java:-2
> 13/11/17 23:16:47 INFO DAGScheduler: number of dependencies: 1
>
>
> In this case, Spark seems to automatically configure the number of
> partitions for the result tasks to be 2. The result is that only two reduce
> tasks run (one on each machine). Is there a way to modify this number? More
> generally how do you configure the number of reduce tasks?
>
> thanks!
> Umar
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message