spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JG Perrin <>
Subject RE: HDFS or NFS as a cache?
Date Fri, 29 Sep 2017 19:03:22 GMT
You will collect in the driver (often the master) and it will save the data, so for saving,
you will not have to set up HDFS.

From: Alexander Czech []
Sent: Friday, September 29, 2017 8:15 AM
Subject: HDFS or NFS as a cache?

I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write parquet files to S3.
But the S3 performance for various reasons is bad when I access s3 through the parquet write

Now I want to setup a small cache for the parquet output. One output is about 12-15 GB in
size. Would it be enough to setup a NFS-directory on the master, write the output to it and
then move it to S3? Or should I setup a HDFS on the Master? Or should I even opt for an additional
cluster running a HDFS solution on more than one node?
View raw message