spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Henry <londonjava...@gmail.com>
Subject Re: Data growth vs Cluster Size planning
Date Tue, 12 Feb 2019 11:26:50 GMT
Too little information to give an answer, if indeed an answer a priori is
possible.

However, I would do the following on your test instances:

- Run jstat -gc on all your nodes. It might be that the GC is taking a lot
of time.

- Poll with jstack semi frequently. I can give you a fairly good idea where
in the code the time is being spent in a non-invasive manner.

Phillip



On Mon, Feb 11, 2019 at 9:48 AM Aakash Basu <aakash.spark.raj@gmail.com>
wrote:

> Hi,
>
> I ran a dataset of *200 columns and 0.2M records* in a cluster of *1
> master 18 GB, 2 slaves 32 GB each, **16 cores/slave*, took around *772
> minutes* for a *very large ML tuning based job* (training).
>
> Now, my requirement is to run the *same operation on 3M records*. Any
> idea on how we should proceed? Should we go for a vertical scaling or a
> horizontal one? How should this problem be approached in a
> stepwise/systematic manner?
>
> Thanks in advance.
>
> Regards,
> Aakash.
>

Mime
View raw message