spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeroen Miller <bluedasya...@gmail.com>
Subject Re: More instances = slower Spark job
Date Thu, 28 Sep 2017 21:45:52 GMT
On Thu, Sep 28, 2017 at 9:02 PM, Jörn Franke <jornfranke@gmail.com> wrote:
> It looks to me a little bit strange. First json.gz files are single threaded, ie each
file can only be processed by one thread (so it is good to have many files of around 128 MB
to 512 MB size each).

Indeed. Unfortunately, the files I have to work with are quite a bit larger.

> Then what you do in the code is already done by the data source. There is no need to
read the file directory and parallelize. Just provide the directory containing the files to
the data source and Spark automatically takes care to read them from different executors.

Very true. My motivation behind my contrived idea is that I need to
replicate the same file tree structure after filtering -- that does
not seems easy if I build a huge RDD from all input files.

> In order  improve write Performance check if you can store them in Avro (or parquet or
orc) using their internal compression feature. Then you can have even many threads/file.

Indeed, 50% of my processing time is spent uploaded the results to S3.

Thank you for your input.

Jeroen

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message