spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
Subject Re: Spark performance on 32 Cpus Server Cluster
Date Fri, 20 Feb 2015 12:18:32 GMT
Hi Sean,
I'm trying to increase the cpu usage by running logistic regression in
different datasets in parallel. They shouldn't depend on each other.
I train several  logistic regression models from different column
combinations of a main dataset. I processed the combinations in a ParArray
in an attempt to increase cpu usage but id did not help.



2015-02-20 8:17 GMT-02:00 Sean Owen <sowen@cloudera.com>:

> It sounds like your computation just isn't CPU bound, right? or maybe
> that only some stages are. It's not clear what work you are doing
> beyond the core LR.
>
> Stages don't wait on each other unless one depends on the other. You'd
> have to clarify what you mean by running stages in parallel, like what
> are the interdependencies.
>
> On Fri, Feb 20, 2015 at 10:01 AM, Dirceu Semighini Filho
> <dirceu.semighini@gmail.com> wrote:
> > Hi all,
> > I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
> > server sizes. All of my data is cached in memory.
> > Basically I have a mass of data, about 8gb, with about 37k of columns,
> and
> > I'm running different configs of an BinaryLogisticRegressionBFGS.
> > When I put spark to run on 9 servers (1 master and 8 slaves), with 32
> cores
> > each. I noticed that the cpu usage was varying from 20% to 50% (counting
> > the cpu usage of 9 servers in the cluster).
> > First I tried to repartition the Rdds to the same number of total client
> > cores (256), but that didn't help. After I've tried to change the
> > property *spark.default.parallelism
> > * to the same number (256) but that didn't helped to increase the cpu
> usage.
> > Looking at the spark monitoring tool, I saw that some stages  took 52s to
> > be completed.
> > My last shot was trying to run some tasks in parallel, but when I start
> > running tasks in parallel (4 tasks) the total cpu time spent to complete
> > this has increased in about 10%, task parallelism didn't helped.
> > Looking at the monitoring tool I've noticed that when running tasks in
> > parallel, the stages complete together, if I have 4 stages running in
> > parallel (A,B,C and D), if A, B and C finishes, they will wait for D to
> > mark all this 4 stages as completed, is that right?
> > Is there any way to improve the cpu usage when running on large servers?
> > Spending more time when running tasks is an expected behaviour?
> >
> > Kind Regards,
> > Dirceu
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message