spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Umar Javed <umarj.ja...@gmail.com>
Subject configuring final partition length
Date Mon, 18 Nov 2013 07:24:41 GMT
I'm using pyspark. I was wondering how to modify the number of partitions
for the result task (reduce in my case). I'm running Spark on a cluster of
two machines (each with 16 cores). Here's the relevant log output for my
result stage:

13/11/17 23:16:47 INFO SparkContext: time: 18851958895218046
*13/11/17 23:16:47 INFO SparkContext: partition length: 2*
13/11/17 23:16:47 DEBUG DAGScheduler: Got event of type
org.apache.spark.scheduler.JobSubmitted
13/11/17 23:16:47 INFO DAGScheduler: class of RDD: class
org.apache.spark.api.python.PythonRDD PythonRDD[6] at RDD at
PythonRDD.scala:34
13/11/17 23:16:47 INFO DAGScheduler: number of dependencies: 1
13/11/17 23:16:47 INFO DAGScheduler: class of dep: class
org.apache.spark.rdd.MappedRDD MappedRDD[5] at values at
NativeMethodAccessorImpl.java:-2
13/11/17 23:16:47 INFO DAGScheduler: class of RDD: class
org.apache.spark.rdd.MappedRDD MappedRDD[5] at values at
NativeMethodAccessorImpl.java:-2
13/11/17 23:16:47 INFO DAGScheduler: number of dependencies: 1
13/11/17 23:16:47 INFO DAGScheduler: class of dep: class
org.apache.spark.rdd.ShuffledRDD ShuffledRDD[4] at partitionBy at
NativeMethodAccessorImpl.java:-2
13/11/17 23:16:47 INFO DAGScheduler: class of RDD: class
org.apache.spark.rdd.ShuffledRDD ShuffledRDD[4] at partitionBy at
NativeMethodAccessorImpl.java:-2
13/11/17 23:16:47 INFO DAGScheduler: number of dependencies: 1


In this case, Spark seems to automatically configure the number of
partitions for the result tasks to be 2. The result is that only two reduce
tasks run (one on each machine). Is there a way to modify this number? More
generally how do you configure the number of reduce tasks?

thanks!
Umar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message