spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Czech <alexander.cz...@googlemail.com>
Subject HDFS or NFS as a cache?
Date Fri, 29 Sep 2017 13:15:21 GMT
I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write
parquet files to S3. But the S3 performance for various reasons is bad when
I access s3 through the parquet write method:

df.write.parquet('s3a://bucket/parquet')

Now I want to setup a small cache for the parquet output. One output is
about 12-15 GB in size. Would it be enough to setup a NFS-directory on the
master, write the output to it and then move it to S3? Or should I setup a
HDFS on the Master? Or should I even opt for an additional cluster running
a HDFS solution on more than one node?

thanks!

Mime
View raw message