"Workers" has a specific meaning in Spark. You are running many on one
machine? that's possible but not usual.
Each worker's executors have access to a fraction of your machine's
resources then. If you're not increasing parallelism, maybe you're not
actually using additional workers, so are using less resource for your
Or because the resulting executors are smaller, maybe you're hitting
GC thrashing in these executors with smaller heaps.
Or if you're not actually configuring the executors to use less
memory, maybe you're over-committing your RAM and swapping?
Bottom line, you wouldn't use multiple workers on one small standalone
node. This isn't a good way to estimate performance on a distributed
On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan <firstname.lastname@example.org> wrote:
> No, I just have a single node standalone cluster.
> I am not tweaking around with the code to increase parallelism. I am just
> running SparkKMeans that is there in Spark-1.0.0
> I just wanted to know, if this behavior is natural. And if so, what causes
> Thank you
> On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen <email@example.com> wrote:
>> What's your storage like? are you adding worker machines that are
>> remote from where the data lives? I wonder if it just means you are
>> spending more and more time sending the data over the network as you
>> try to ship more of it to more remote workers.
>> To answer your question, no in general more workers means more
>> parallelism and therefore faster execution. But that depends on a lot
>> of things. For example, if your process isn't parallelize to use all
>> available execution slots, adding more slots doesn't do anything.
>> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan <firstname.lastname@example.org>
>> > Yes, I am talking about standalone single node cluster.
>> > No, I am not increasing parallelism. I just wanted to know if it is
>> > natural.
>> > Does message passing across the workers account for the happenning?
>> > I am running SparkKMeans, just to validate one prediction model. I am
>> > using
>> > several data sets. I have a standalone mode. I am varying the workers
>> > from 1
>> > to 16
>> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen <email@example.com> wrote:
>> >> I can imagine a few reasons. Adding workers might cause fewer tasks to
>> >> execute locally (?) So you may be execute more remotely.
>> >> Are you increasing parallelism? for trivial jobs, chopping them up
>> >> further may cause you to pay more overhead of managing so many small
>> >> tasks, for no speed up in execution time.
>> >> Can you provide any more specifics though? you haven't said what
>> >> you're running, what mode, how many workers, how long it takes, etc.
>> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan
>> >> <firstname.lastname@example.org>
>> >> wrote:
>> >> > Hi,
>> >> > I have been running some jobs in my local single node stand alone
>> >> > cluster. I
>> >> > am varying the worker instances for the same job, and the time taken
>> >> > for
>> >> > the
>> >> > job to complete increases with increase in the number of workers. I
>> >> > repeated
>> >> > some experiments varying the number of nodes in a cluster too and the
>> >> > same
>> >> > behavior is seen.
>> >> > Can the idea of worker instances be extrapolated to the nodes in a
>> >> > cluster?
>> >> >
>> >> > Thank You