spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: More instances = slower Spark job
Date Sun, 01 Oct 2017 22:53:53 GMT

> On 28 Sep 2017, at 14:45, ayan guha <> wrote:
> Hi
> Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text file? Does
it use InputFormat do create multiple splits and creates 1 partition per split?

Yes, Input formats give you their splits, this is usually used to decide how to break things
up, As to how that gets used: you'll have to look at the source as I'll only get it wrong.
Key point: it's part of the information which can be used to partition the work, but the number
of available workers is the other big factor.

> Also, in case of S3 or NFS, how does the input split work? I understand for HDFS files
are already pre-split so Spark can use dfs.blocksize to determine partitions. But how does
it work other than HDFS?

there's invariably a config option to allow you tell spark what blocksize to work with, e.g
fs.s3a.block.size ., which you set in spark defaults to something like

spark.hadoop.fs.s3a.block.size 67108864

to set it to 64MB. 

HDFS also provides locality information: where the data is. Other filesytems don't do that,
they usually just say "localhost", which Spark recognises as "anywhere" schedules work
on different parts of a file wherever there is free capacity.

To unsubscribe e-mail:

View raw message