spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: More instances = slower Spark job
Date Thu, 28 Sep 2017 13:45:57 GMT
Hi

Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text
file? Does it use InputFormat do create multiple splits and creates 1
partition per split? Also, in case of S3 or NFS, how does the input split
work? I understand for HDFS files are already pre-split so Spark can use
dfs.blocksize to determine partitions. But how does it work other than HDFS?

On Thu, Sep 28, 2017 at 11:26 PM, Daniel Siegmann <
dsiegmann@securityscorecard.io> wrote:

>
> no matter what you do and how many nodes you start, in case you have a
>> single text file, it will not use parallelism.
>>
>
> This is not true, unless the file is small or is gzipped (gzipped files
> cannot be split).
>



-- 
Best Regards,
Ayan Guha

Mime
View raw message