As the others have mentioned, your loading time might kill your benchmark… I am in a similar process right now, but I time each operation, load, process 1, process 2, etc. not always easy with lazy operators, but you can try to force operations with false collect and cache (for benchmarking purpose).

 

Also, give processing more importance (unless you really only want to have this light processing). Heavy computation (ML for example) should show a difference, but it may not be your use case.

 

From: Sonal Goyal [mailto:sonalgoyal4@gmail.com]
Sent: Thursday, September 28, 2017 4:30 AM
To: Tejeshwar J1 <tejeshwar.j1@globallogic.com.invalid>
Cc: Jeroen Miller <bluedasyatis@gmail.com>; user@spark.apache.org
Subject: Re: More instances = slower Spark job

 

Also check if the compression algorithm you use is splittable? 


Thanks,
Sonal
Nube Technologies 

 



 

On Thu, Sep 28, 2017 at 2:17 PM, Tejeshwar J1 <tejeshwar.j1@globallogic.com.invalid> wrote:

Hi Miller,

 

Try using

1.coalesce(numberOfPartitions) to reduce the number of partitions in order to avoid idle cores .

2.Try reducing executor memory as you increase the number of executors.

3. Try performing GC or changing naïve java serialization to kryo serialization.

 

 

Thanks,

Tejeshwar

 

 

From: Jeroen Miller [mailto:bluedasyatis@gmail.com]
Sent: Thursday, September 28, 2017 2:11 PM
To: user@spark.apache.org
Subject: More instances = slower Spark job

 

Hello,

 

I am experiencing a disappointing performance issue with my Spark jobs

as I scale up the number of instances.

 

The task is trivial: I am loading large (compressed) text files from S3,

filtering out lines that do not match a regex, counting the numbers

of remaining lines and saving the resulting datasets as (compressed)

text files on S3. Nothing that a simple grep couldn't do, except that

the files are too large to be downloaded and processed locally.

 

On a single instance, I can process X GBs per hour. When scaling up

to 10 instances, I noticed that processing the /same/ amount of data

actually takes /longer/.

 

This is quite surprising as the task is really simple: I was expecting

a significant speed-up. My naive idea was that each executors would

process a fraction of the input file, count the remaining lines /locally/,

and save their part of the processed file /independently/, thus no data

shuffling would occur.

 

Obviously, this is not what is happening.

 

Can anyone shed some light on this or provide pointers to relevant

information?

 

Regards,

 

Jeroen