spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: More instances = slower Spark job
Date Thu, 28 Sep 2017 14:23:58 GMT
Hi,

I will be very surprised if someone tells me that a 1 GB CSV text file is
automatically split and read by multiple executors in SPARK. It does not
matter whether it stays in HDFS, S3 or any other system.

Now if someone tells me that in case I have a smaller CSV file of 100MB
size and that will be split while being read, that will also be surprising.

Once SPARK has loaded it into its cache, things are ofcourse different.


Regards,
Gourav

On Thu, Sep 28, 2017 at 2:45 PM, ayan guha <guha.ayan@gmail.com> wrote:

> Hi
>
> Can you kindly explain how Spark uses parallelism for bigger (say 1GB)
> text file? Does it use InputFormat do create multiple splits and creates 1
> partition per split? Also, in case of S3 or NFS, how does the input split
> work? I understand for HDFS files are already pre-split so Spark can use
> dfs.blocksize to determine partitions. But how does it work other than HDFS?
>
> On Thu, Sep 28, 2017 at 11:26 PM, Daniel Siegmann <
> dsiegmann@securityscorecard.io> wrote:
>
>>
>> no matter what you do and how many nodes you start, in case you have a
>>> single text file, it will not use parallelism.
>>>
>>
>> This is not true, unless the file is small or is gzipped (gzipped files
>> cannot be split).
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Mime
View raw message