spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Berman <igor.ber...@gmail.com>
Subject Re: Parallel execution of RDDs
Date Mon, 31 Aug 2015 14:07:27 GMT
what is size of the pool you submitting spark jobs from(futures you've
mentioned)? is it 8? I think you have fixed thread pool of 8 so there can't
be more than 8 parallel jobs running...so try to increase it
what is number of partitions of each of your rdds?
how many cores has your worker machine(those 15 you've mentioned)
e.g. if you have 15 * 8 cores but your rdd with 1000 partitions - there is
no way you'll get parallel job execution since only 1 job already fills all
cores with tasks(unless you are going to manage resources per each
submit/job)



On 31 August 2015 at 16:51, Brian Parker <astone065@gmail.com> wrote:

> Hi, I have a large number of RDDs that I need to process separately.
> Instead of submitting these jobs to the Spark scheduler one by one, I'd
> like to submit them in parallel in order to maximize cluster utilization.
>
> I've tried to process the RDDs as Futures, but the number of Active jobs
> maxes out at 8 and the run time is no faster than serial processing (even
> with a 15 node cluster).  What is the limitation on number of Active jobs
> in the Spark scheduler?
>
> What are some strategies to maximize cluster utilization with
> many(possibly small) RDDs ?  Is this a good use case for Spark Streaming?
>

Mime
View raw message