spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Siegmann <dsiegm...@securityscorecard.io>
Subject Re: More instances = slower Spark job
Date Thu, 28 Sep 2017 14:32:40 GMT
On Thu, Sep 28, 2017 at 7:23 AM, Gourav Sengupta <gourav.sengupta@gmail.com>
wrote:

>
> I will be very surprised if someone tells me that a 1 GB CSV text file is
> automatically split and read by multiple executors in SPARK. It does not
> matter whether it stays in HDFS, S3 or any other system.
>

I can't speak to *any* system, but I can confirm for HDFS, S3, and local
filesystems a 1 GB CSV file would be split.


>
> Now if someone tells me that in case I have a smaller CSV file of 100MB
> size and that will be split while being read, that will also be surprising.
>

I'm not sure what the default is. It may be 128 MB, in which case that file
would not be split.

Keep in mind gzipped files cannot be split. If you have very large text
files and you want to compress them, and they will be > a few hundred MB
compressed, you should probably use bzip2 instead (which can be split).

Mime
View raw message