spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prem Sure <sparksure...@gmail.com>
Subject Re: Inferring Data driven Spark parameters
Date Wed, 04 Jul 2018 14:17:14 GMT
Can you share the API that your jobs use.. just core RDDs or SQL or
DStreams..etc?
refer  recommendations from
https://spark.apache.org/docs/2.3.0/configuration.html for detailed
configurations.
Thanks,
Prem

On Wed, Jul 4, 2018 at 12:34 PM, Aakash Basu <aakash.spark.raj@gmail.com>
wrote:

> I do not want to change executor/driver cores/memory on the fly in a
> single Spark job, all I want is to make them cluster specific. So, I want
> to have a formulae, with which, depending on the size of driver and
> executor details, I can find out the values for them before submitting
> those details in the spark-submit.
>
> I, more or less know how to achieve the above as I've previously done that.
>
> All I need to do is, I want to tweak the other spark confs depending on
> the data. Is that possible? I mean (just an example), if I have 100+
> features, I want to double my default spark.driver.maxResultSize to 2G, and
> similarly for other configs. Can that be achieved by any means for a
> optimal run on that kind of dataset? If yes, can I?
>
> On Tue, Jul 3, 2018 at 6:28 PM, Vadim Semenov <vadim@datadoghq.com> wrote:
>
>> You can't change the executor/driver cores/memory on the fly once
>> you've already started a Spark Context.
>> On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu <aakash.spark.raj@gmail.com>
>> wrote:
>> >
>> > We aren't using Oozie or similar, moreover, the end to end job shall be
>> exactly the same, but the data will be extremely different (number of
>> continuous and categorical columns, vertical size, horizontal size, etc),
>> hence, if there would have been a calculation of the parameters to arrive
>> at a conclusion that we can simply get the data and derive the respective
>> configuration/parameters, it would be great.
>> >
>> > On Tue, Jul 3, 2018 at 1:09 PM, Jörn Franke <jornfranke@gmail.com>
>> wrote:
>> >>
>> >> Don’t do this in your job. Create for different types of jobs
>> different jobs and orchestrate them using oozie or similar.
>> >>
>> >> On 3. Jul 2018, at 09:34, Aakash Basu <aakash.spark.raj@gmail.com>
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Cluster - 5 node (1 Driver and 4 workers)
>> >> Driver Config: 16 cores, 32 GB RAM
>> >> Worker Config: 8 cores, 16 GB RAM
>> >>
>> >> I'm using the below parameters from which I know the first chunk is
>> cluster dependent and the second chunk is data/code dependent.
>> >>
>> >> --num-executors 4
>> >> --executor-cores 5
>> >> --executor-memory 10G
>> >> --driver-cores 5
>> >> --driver-memory 25G
>> >>
>> >>
>> >> --conf spark.sql.shuffle.partitions=100
>> >> --conf spark.driver.maxResultSize=2G
>> >> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
>> >> --conf spark.scheduler.listenerbus.eventqueue.capacity=20000
>> >>
>> >> I've come upto these values depending on my R&D on the properties and
>> the issues I faced and hence the handles.
>> >>
>> >> My ask here is -
>> >>
>> >> 1) How can I infer, using some formula or a code, to calculate the
>> below chunk dependent on the data/code?
>> >> 2) What are the other usable properties/configurations which I can use
>> to shorten my job runtime?
>> >>
>> >> Thanks,
>> >> Aakash.
>> >
>> >
>>
>>
>> --
>> Sent from my iPhone
>>
>
>

Mime
View raw message