spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Czech <alexander.cz...@googlemail.com>
Subject Re: HDFS or NFS as a cache?
Date Fri, 29 Sep 2017 14:59:22 GMT
Yes I have identified the rename as the problem, that is why I think the
extra bandwidth of the larger instances might not help. Also there is a
consistency issue with S3 because of the how the rename works so that I
probably lose data.

On Fri, Sep 29, 2017 at 4:42 PM, Vadim Semenov <vadim.semenov@datadoghq.com>
wrote:

> How many files you produce? I believe it spends a lot of time on renaming
> the files because of the output committer.
> Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they
> have 10GbE and you can get good throughput for S3.
>
> On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech <
> alexander.czech@googlemail.com> wrote:
>
>> I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write
>> parquet files to S3. But the S3 performance for various reasons is bad when
>> I access s3 through the parquet write method:
>>
>> df.write.parquet('s3a://bucket/parquet')
>>
>> Now I want to setup a small cache for the parquet output. One output is
>> about 12-15 GB in size. Would it be enough to setup a NFS-directory on the
>> master, write the output to it and then move it to S3? Or should I setup a
>> HDFS on the Master? Or should I even opt for an additional cluster running
>> a HDFS solution on more than one node?
>>
>> thanks!
>>
>
>

Mime
View raw message