spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wang, Ningjun (LNG-NPV)" <ningjun.w...@lexisnexis.com>
Subject RE: How to force parallel processing of RDD using multiple thread
Date Fri, 16 Jan 2015 14:14:47 GMT
Does parallel processing mean it is executed in multiple worker or executed in one worker but
multiple threads? For example if I have only one worker but my RDD has 4 partition, will it
be executed parallel in 4 thread?

The reason I am asking is try to decide whether I need to configure spark to have multiple
workers. By default, it just start with one worker.

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541


-----Original Message-----
From: Sean Owen [mailto:sowen@cloudera.com] 
Sent: Thursday, January 15, 2015 11:04 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to force parallel processing of RDD using multiple thread

Check the number of partitions in your input. It may be much less than the available parallelism
of your small cluster. For example, input that lives in just 1 partition will spawn just 1
task.

Beyond that parallelism just happens. You can see the parallelism of each operation in the
Spark UI.

On Thu, Jan 15, 2015 at 10:53 PM, Wang, Ningjun (LNG-NPV) <ningjun.wang@lexisnexis.com>
wrote:
> Spark Standalone cluster.
>
> My program is running very slow, I suspect it is not doing parallel processing of rdd.
How can I force it to run parallel? Is there anyway to check whether it is processed in parallel?
>
> Regards,
>
> Ningjun Wang
> Consulting Software Engineer
> LexisNexis
> 121 Chanlon Road
> New Providence, NJ 07974-1541
>
>
> -----Original Message-----
> From: Sean Owen [mailto:sowen@cloudera.com]
> Sent: Thursday, January 15, 2015 4:29 PM
> To: Wang, Ningjun (LNG-NPV)
> Cc: user@spark.apache.org
> Subject: Re: How to force parallel processing of RDD using multiple 
> thread
>
> What is your cluster manager? For example on YARN you would specify --executor-cores.
Read:
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
> On Thu, Jan 15, 2015 at 8:54 PM, Wang, Ningjun (LNG-NPV) <ningjun.wang@lexisnexis.com>
wrote:
>> I have a standalone spark cluster with only one node with 4 CPU cores.
>> How can I force spark to do parallel processing of my RDD using 
>> multiple threads? For example I can do the following
>>
>>
>>
>> Spark-submit  --master local[4]
>>
>>
>>
>> However I really want to use the cluster as follow
>>
>>
>>
>> Spark-submit  --master spark://10.125.21.15:7070
>>
>>
>>
>> In that case, how can I make sure the RDD is processed with multiple 
>> threads/cores?
>>
>>
>>
>> Thanks
>>
>> Ningjun
>>
>>
Mime
View raw message