spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiang Huo <huoxiang5...@gmail.com>
Subject Re: How set how many cpu can be used by spark
Date Tue, 24 Sep 2013 03:50:12 GMT
What I am doing is splitting a large RDD into several small ones and write
each small one into separate file. I use Partitioner and PairRDDFunctions
classes to finish this things. But what is very strange is that every time
I run my program, all CPUs could run with full workload at the beginning.
But at the end, There are only one cpu can work with full workload. I guess
the reason is the part of writing to disk, But I have no idea how to
improve it.

This is the code I use to write each partition into file:
 data_partitions.foreachPartition(r => {
   if(r.nonEmpty){
   val filename = r.take(1).toArray.apply(0)._1
   //println("*************************")
   println("Filename: " + filename)
   val outFile = new java.io.FileWriter(outPath + filename)
   r.map(record => new ParseDNSFast().antiConvert(record._2)).foreach(r =>
outFile.write(r+"\n"))
   outFile.close
   }
   })

Any help is appreciated!.




2013/9/23 Reynold Xin <rxin@cs.berkeley.edu>

> It's probably just because your application is only using 2 threads. Spark
> should be allocating a thread pool large enough, but the RDD's you are
> operating on have only 2 partitions, for example.
>
> To give it a try, do
>
> sc.parallelize(1 to 10, 20).mapPartitions { iter =>
> Thread.sleep(10000000); iter }.count
>
> And see if you have 20 tasks being launched.
>
>
>
> --
> Reynold Xin, AMPLab, UC Berkeley
> http://rxin.org
>
>
>
> On Sun, Sep 22, 2013 at 9:48 PM, Xiang Huo <huoxiang5659@gmail.com> wrote:
>
>> Hi all,
>>
>> I am trying to run a spark program on a server. It is not a cluster but
>> only a server. I want to configure my spark program can use at most 20 CPU,
>> because this machine is also shared by other users.
>>
>> I know I can set local[K] as the value of Master URLs to limited how many
>> worker threads in this program. But after I run my program, there is only
>> at least two CPUs used. And the program will be run a long time if there is
>> only one or two cpus used.
>>
>> Does any one have met similar situation or have any suggestion?
>>
>> Thanks.
>>
>> Xiang
>> --
>> Xiang Huo
>> Department of Computer Science
>> University of Illinois at Chicago(UIC)
>> Chicago, Illinois
>> US
>> Email: huoxiang5659@gmail.com
>>            or xhuo4@uic.edu
>>
>
>


-- 
Xiang Huo
Department of Computer Science
University of Illinois at Chicago(UIC)
Chicago, Illinois
US
Email: huoxiang5659@gmail.com
           or xhuo4@uic.edu

Mime
View raw message