spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JG Perrin <jper...@lumeris.com>
Subject RE: HDFS or NFS as a cache?
Date Mon, 02 Oct 2017 18:26:36 GMT
Steve,

If I refer to the collect() API, it says “Running collect requires moving all the data into
the application's driver process, and doing so on a very large dataset can crash the driver
process with OutOfMemoryError.” So why would you need a distributed FS?

jg

From: Steve Loughran [mailto:stevel@hortonworks.com]
Sent: Saturday, September 30, 2017 6:10 AM
To: JG Perrin <jperrin@lumeris.com>
Cc: Alexander Czech <alexander.czech@googlemail.com>; user@spark.apache.org
Subject: Re: HDFS or NFS as a cache?


On 29 Sep 2017, at 20:03, JG Perrin <jperrin@lumeris.com<mailto:jperrin@lumeris.com>>
wrote:

You will collect in the driver (often the master) and it will save the data, so for saving,
you will not have to set up HDFS.

no, it doesn't work quite like that.

1. workers generate their data and save somwhere
2. on "task commit" they move their data to some location where it will be visible for "job
commit" (rename, upload, whatever)
3. job commit —which is done in the driver,— takes all the committed task data and makes
it visible in the destination directory.
4. Then they create a _SUCCESS file to say "done!"


This is done with Spark talking between workers and drivers to guarantee that only one task
working on a specific part of the data commits their work, only
committing the job once all tasks have finished

The v1 mapreduce committer implements (2) by moving files under a job attempt dir, and (3)
by moving it from the job attempt dir to the destination. one rename per task commit, another
rename of every file on job commit. In HFDS, Azure wasb and other stores with an O(1) atomic
rename, this isn't *too* expensve, though that final job commit rename still takes time to
list and move lots of files

The v2 committer implements (2) by renaming to the destination directory and (3) as a no-op.
Rename in the tasks then, but not not that second, serialized one at the end

There's no copy of data from workers to driver, instead you need a shared output filesystem
so that the job committer can do its work alongside the tasks.

There are alternatives committer agorithms,

1. look at Ryan Blue's talk: https://www.youtube.com/watch?v=BgHrff5yAQo
2. IBM Stocator paper (https://arxiv.org/abs/1709.01812) and code (https://github.com/SparkTC/stocator/)
3. Ongoing work in Hadoop itself for better committers. Goal: year end & Hadoop 3.1 https://issues.apache.org/jira/browse/HADOOP-13786
. The oode is all there, Parquet is a troublespot, and more testing is welcome from anyone
who wants to help.
4. Databricks have "something"; specifics aren't covered, but I assume its dynamo DB based


-Steve







From: Alexander Czech [mailto:alexander.czech@googlemail.com]
Sent: Friday, September 29, 2017 8:15 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: HDFS or NFS as a cache?

I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write parquet files to S3.
But the S3 performance for various reasons is bad when I access s3 through the parquet write
method:

df.write.parquet('s3a://bucket/parquet')
Now I want to setup a small cache for the parquet output. One output is about 12-15 GB in
size. Would it be enough to setup a NFS-directory on the master, write the output to it and
then move it to S3? Or should I setup a HDFS on the Master? Or should I even opt for an additional
cluster running a HDFS solution on more than one node?
thanks!

Mime
View raw message