spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Henry <>
Subject Re: Data growth vs Cluster Size planning
Date Tue, 12 Feb 2019 11:26:50 GMT
Too little information to give an answer, if indeed an answer a priori is

However, I would do the following on your test instances:

- Run jstat -gc on all your nodes. It might be that the GC is taking a lot
of time.

- Poll with jstack semi frequently. I can give you a fairly good idea where
in the code the time is being spent in a non-invasive manner.


On Mon, Feb 11, 2019 at 9:48 AM Aakash Basu <>

> Hi,
> I ran a dataset of *200 columns and 0.2M records* in a cluster of *1
> master 18 GB, 2 slaves 32 GB each, **16 cores/slave*, took around *772
> minutes* for a *very large ML tuning based job* (training).
> Now, my requirement is to run the *same operation on 3M records*. Any
> idea on how we should proceed? Should we go for a vertical scaling or a
> horizontal one? How should this problem be approached in a
> stepwise/systematic manner?
> Thanks in advance.
> Regards,
> Aakash.

View raw message