spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Maas <gerard.m...@gmail.com>
Subject Re: How to force parallel processing of RDD using multiple thread
Date Fri, 16 Jan 2015 14:44:18 GMT
Spark will use the number of cores available in the cluster. If your
cluster is 1 node with 4 cores, Spark will execute up to 4 tasks in
parallel.
Setting your #of partitions to 4 will ensure an even load across cores.
Note that this is different from saying "threads" - Internally Spark uses
many threads  (data block sender/receiver, listeners, notifications,
scheduler, ...)

-kr, Gerard.

On Fri, Jan 16, 2015 at 3:14 PM, Wang, Ningjun (LNG-NPV) <
ningjun.wang@lexisnexis.com> wrote:

> Does parallel processing mean it is executed in multiple worker or
> executed in one worker but multiple threads? For example if I have only one
> worker but my RDD has 4 partition, will it be executed parallel in 4 thread?
>
> The reason I am asking is try to decide whether I need to configure spark
> to have multiple workers. By default, it just start with one worker.
>
> Regards,
>
> Ningjun Wang
> Consulting Software Engineer
> LexisNexis
> 121 Chanlon Road
> New Providence, NJ 07974-1541
>
>
> -----Original Message-----
> From: Sean Owen [mailto:sowen@cloudera.com]
> Sent: Thursday, January 15, 2015 11:04 PM
> To: Wang, Ningjun (LNG-NPV)
> Cc: user@spark.apache.org
> Subject: Re: How to force parallel processing of RDD using multiple thread
>
> Check the number of partitions in your input. It may be much less than the
> available parallelism of your small cluster. For example, input that lives
> in just 1 partition will spawn just 1 task.
>
> Beyond that parallelism just happens. You can see the parallelism of each
> operation in the Spark UI.
>
> On Thu, Jan 15, 2015 at 10:53 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.wang@lexisnexis.com> wrote:
> > Spark Standalone cluster.
> >
> > My program is running very slow, I suspect it is not doing parallel
> processing of rdd. How can I force it to run parallel? Is there anyway to
> check whether it is processed in parallel?
> >
> > Regards,
> >
> > Ningjun Wang
> > Consulting Software Engineer
> > LexisNexis
> > 121 Chanlon Road
> > New Providence, NJ 07974-1541
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:sowen@cloudera.com]
> > Sent: Thursday, January 15, 2015 4:29 PM
> > To: Wang, Ningjun (LNG-NPV)
> > Cc: user@spark.apache.org
> > Subject: Re: How to force parallel processing of RDD using multiple
> > thread
> >
> > What is your cluster manager? For example on YARN you would specify
> --executor-cores. Read:
> > http://spark.apache.org/docs/latest/running-on-yarn.html
> >
> > On Thu, Jan 15, 2015 at 8:54 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.wang@lexisnexis.com> wrote:
> >> I have a standalone spark cluster with only one node with 4 CPU cores.
> >> How can I force spark to do parallel processing of my RDD using
> >> multiple threads? For example I can do the following
> >>
> >>
> >>
> >> Spark-submit  --master local[4]
> >>
> >>
> >>
> >> However I really want to use the cluster as follow
> >>
> >>
> >>
> >> Spark-submit  --master spark://10.125.21.15:7070
> >>
> >>
> >>
> >> In that case, how can I make sure the RDD is processed with multiple
> >> threads/cores?
> >>
> >>
> >>
> >> Thanks
> >>
> >> Ningjun
> >>
> >>
>

Mime
View raw message