How many partitions are in your input data set? A possibility is that your input data has 10 unsplittable files, so you end up with 10 partitions. You could improve this by using RDD#repartition(). 

Note that mapPartitionsWithIndex is sort of the "main processing loop" for many Spark functions. It is iterating through all the elements of the partition and doing some computation (probably running your user code) on it. 

You can see the number of partitions in your RDD by visiting the Spark driver web interface. To access this, visit port 8080 on host running your Standalone Master (assuming you're running standalone mode), which will have a link to the application web interface. The Tachyon master also has a useful web interface, available at port 19999.


On Sun, May 25, 2014 at 5:43 PM, qingyang li <liqingyang1985@gmail.com> wrote:
hi, Mayur, thanks for replying.
I know spark application should take all cores by default. My question is  how to set task number on each core ?
If one silce, one task,  how can i set silce file size ?


2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rustagi@gmail.com>:

How many cores do you see on your spark master (8080 port). 
By default spark application should take all cores when you launch it. Unless you have set max core configuration. 




On Thu, May 22, 2014 at 4:07 PM, qingyang li <liqingyang1985@gmail.com> wrote:
my aim of setting task number is to increase the query speed,    and I have also found " mapPartitionsWithIndex at Operator.scala:333"  is costing much time.  so, my another question is :
how to tunning mapPartitionsWithIndex  to make the costing time down?




2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:

i have added  SPARK_JAVA_OPTS+="-Dspark.
default.parallelism=40 "  in shark-env.sh,  
but i find there are only10 tasks on the cluster and 2 tasks each machine.


2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:

i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 "  in shark-env.sh


2014-05-22 17:50 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:

i am using tachyon as storage system and using to shark to query a table which is a bigtable, i have 5 machines as a spark cluster, there are 4 cores on each machine .
My question is:
1. how to set task number on each core?
2. where to see how many partitions of one RDD?