spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Nguyen <>
Subject Re: Drawback of Spark memory model as compared to Hadoop?
Date Sun, 13 Oct 2013 08:34:28 GMT

> - node failure?
> - not able to handle if intermediate data > memory size of a node
> - cost

Spark uses recomputation, aka "journaling" to provide resiliency in case of
node failure, thus providing node-level recovery behavior much as Hadoop
MapReduce, except much faster recovery in most cases. Spark is also
designed to allow spill-to-disk if a given node doesn't have enough RAM to
hold its data partitions, thus provides graceful degradation to disk-based
data handling. As for cost, at $5/GB street RAM prices, meaning you can
have up to 1TB RAM for about $5K, memory is becoming a smaller fraction of
total node cost.

If your larger question is "Hadoop MR or Spark?", or more generally,
"disk-based or RAM-based distributed computing?", the correct answer is "It
depends." And the variables "it" depends on are dynamically changing over

A way to think about this is to see that there is a cost-benefit crossover
point for every unique organization/business-use-case combination, before
which disk is preferred, and beyond which, RAM is preferred. For many
Wall-Street mission critical apps, where milliseconds can mean millions of
dollars, many of these crossover points were passed in the mid-2000's. At
Google, a large organization with large datasets and high productivity
($1.2M/employee-year), you can see similar crossovers in the late
2000's/early 2010's (cf. PowerDrill). The general industry is undergoing
similar evaluations.

The next question to ask is "how are the underlying variables changing?"
Consider for example how latencies are evolving across technologies in your
compute path, even as each is getting cheaper per Moore's Law. For RAM
outside the L1/L2 caches, we're in the 60ns regime going down to 30-40ns.
Network latencies are 100us going down to the 10us range. In contrast, disk
latencies have bottomed out at 4-5ms, and the trend of SSD reads is
actually going up from 20us to 30-40us (to get higher densities). You could
do similar projections of bandwidths. Certainly, these storage technologies
have their place, but the point is that whatever your cost-benefit equation
for in-memory vs disk-based use cases is this year, next year it will shift
more in favor of memory, and inexorably so the year after that.

So trends clearly favor in-memory techniques like Spark. These industry
trends have reinforcing positive feedback: as more organizations adopt
in-memory technologies, it will become uncompetitive for laggards to sit on
the sidelines for the same use cases. A final thing to keep in mind is that
having affordable high performance enables use cases that were not at all
possible before, such as interactive data science with huge datasets.

Christopher T. Nguyen
Co-founder & CEO, Adatao <>

On Sat, Oct 12, 2013 at 8:24 PM, howard chen <> wrote:

> Hello,
> I am new to Spark and have only used Hadoop in the past.
> I understand Spark is in memory as compare to Hadoop who use disk for
> intermediate storage. From the practical term, the benefit must be
> performance, but what would be the drawbacks?
> e.g.
> - node failure?
> - not able to handle if intermediate data > memory size of a node
> - cost
> I would like to hear your experience when using Spark to handle big data,
> and what is the work around in the above cases?
> Thanks.

View raw message