spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Czech <>
Subject Re: HDFS or NFS as a cache?
Date Fri, 29 Sep 2017 14:59:22 GMT
Yes I have identified the rename as the problem, that is why I think the
extra bandwidth of the larger instances might not help. Also there is a
consistency issue with S3 because of the how the rename works so that I
probably lose data.

On Fri, Sep 29, 2017 at 4:42 PM, Vadim Semenov <>

> How many files you produce? I believe it spends a lot of time on renaming
> the files because of the output committer.
> Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they
> have 10GbE and you can get good throughput for S3.
> On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech <
>> wrote:
>> I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write
>> parquet files to S3. But the S3 performance for various reasons is bad when
>> I access s3 through the parquet write method:
>> df.write.parquet('s3a://bucket/parquet')
>> Now I want to setup a small cache for the parquet output. One output is
>> about 12-15 GB in size. Would it be enough to setup a NFS-directory on the
>> master, write the output to it and then move it to S3? Or should I setup a
>> HDFS on the Master? Or should I even opt for an additional cluster running
>> a HDFS solution on more than one node?
>> thanks!

View raw message