spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <>
Subject Re: compare/contrast Spark with Cascading
Date Mon, 28 Oct 2013 19:37:13 GMT
i would say scaling (cascading + DSL for scala) offers similar
functionality to spark, and a similar syntax.
the main difference between spark and scalding is target jobs:
scalding is for long running jobs on very large data. the data is read from
and written to disk between steps. jobs run from minutes to days.
spark is for faster jobs on medium to large data. the data is primarily
held in memory. jobs run from a few seconds to a few hours. although spark
can work with data on disks it still makes assumptions that data needs to
fit in memory for certain steps (although less and less with every
release). spark also makes iterative designs much easier.

i have found them both great to program in and complimentary. we use
scalding for overnight batch processes and spark for more realtime
processes. at this point i would trust scalding a lot more due to the
robustness of the stack, but spark is getting better every day.

On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <> wrote:

> Hi Philip,
> Cascading is relatively agnostic about the distributed topology underneath
> it, especially as of the 2.0 release over a year ago. There's been some
> discussion about writing a flow planner for Spark -- e.g., which would
> replace the Hadoop flow planner. Not sure if there's active work on that
> yet.
> There are a few commercial workflow abstraction layers (probably what was
> meant by "application layer" ?), in terms of the Cascading family (incl.
> Cascalog, Scalding), and also Actian's integration of Hadoop/Knime/etc.,
> and also the work by Continuum, ODG, and others in the Py data stack.
> Spark would not be at the same level of abstraction as Cascading (business
> logic, effectively); however, something like MLbase is ostensibly intended
> for that
> With respect to Spark, two other things to watch... One would definitely
> be the Py data stack and ability to integrate with PySpark, which is
> turning out to be very power abstraction -- quite close to a large segment
> of industry needs.  The other project to watch, on the Scala side, is
> Summingbird and it's evolution at Twitter:
> Paco
> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <>wrote:
>> My team is investigating a number of technologies in the Big Data space.
>> A team member recently got turned on to Cascading<>as
an application layer for orchestrating complex workflows/scenarios.  He
>> asked me if Spark had an "application layer"?  My initial reaction is "no"
>> that Spark would not have a separate orchestration/application layer.
>> Instead, the core Spark API (along with Streaming) would compete directly
>> with Cascading for this kind of functionality and that the two would not
>> likely be all that complementary.  I realize that I am exposing my
>> ignorance here and could be way off.  Is there anyone who knows a bit about
>> both of these technologies who could speak to this in broad strokes?
>> Thanks!
>> Philip

View raw message