You could look at using Cassandra for storage. Spark integrates nicely with Cassandra, and a combination of Spark + Cassandra would give you fast access to structured data in Cassandra, while enabling analytic scenarios via Spark. Cassandra would take care of the replication, as it's one of the core features of the database.
Date: Sat, 24 Jan 2015 23:34:15 +0200 Subject: Full per node replication level (architecture question) From: firstname.lastname@example.org To: email@example.com
I wonder whether any of the file systems supported by Spark, may well support a replication level whereby each node has a full copy of the data.
I realize this was not the main intended scenario of spark/hadoop, but may be a good fit for a compute cluster that needs to be very fast over its input data, and that has data only in the amount of few terabytes in total (which fit nicely on any commodity disk and soon on any SSD).
It would be nice to use Spark map-reduce over the data, and enjoy automatic replication.
It would be also nice to assume Spark can seamlessly manage a job's workflow across such cluster...