spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: compare/contrast Spark with Cascading
Date Mon, 28 Oct 2013 21:36:39 GMT
>
> 1) when you say "Cascading is relatively agnostic about the distributed
> topology underneath it" I take that as a hedge that suggests that while it
> could be possible to run Spark underneath Cascading this is not something
> commonly done or would necessarily be straightforward.  Is this an unfair
> reading between the lines - or is Cascading-on-top-of-Spark an established
> technology stack that people are actually using?


Not yet established technology AFAIK, but I have heard Oscar mention the
possibility of Scalding in the future being able to shift gears, as it were
-- handling flows against very large datasets using Hadoop MR, but then
transparently shifting to using Spark on the backend once a relevant subset
of the data has been reduced/extracted that is small enough to fit into the
aggregate memory of an available Spark cluster.

I think that Paco's point isn't that such things are easily being done
right now so much as it is that the underlying architecture of
Cascading/Scalding is generic or abstract enough that such things are quite
conceivable.



On Mon, Oct 28, 2013 at 2:20 PM, Philip Ogren <philip.ogren@oracle.com>wrote:

>  Hi Paco,
>
> Thank you for the various links and thoughts.  Yes - "workflow abstraction
> layer" is a better term for what I meant.  I have two questions for you:
>
> 1) when you say "Cascading is relatively agnostic about the distributed
> topology underneath it" I take that as a hedge that suggests that while it
> could be possible to run Spark underneath Cascading this is not something
> commonly done or would necessarily be straightforward.  Is this an unfair
> reading between the lines - or is Cascading-on-top-of-Spark an established
> technology stack that people are actually using?
>
> 2) Can you give an example of how Cascading is at a higher level of
> abstraction than Spark?  When I look at the landing page for Scalding
> (which runs on top of Cascading) and JCascalog (which claims to yet another
> level of abstraction above Cascading) I see getting started code snippets
> that look exactly like the sort of thing you do with Spark.  I can
> understand why this is a useful approach for a getting started page but it
> doesn't shed light on how these two technologies might differentiate from
> Spark with respect to the abstraction layer they target.  Any thoughts on
> this (or examples!) would be helpful to me.
>
> Thanks,
> Philip
>
>
>
> On 10/28/2013 1:00 PM, Paco Nathan wrote:
>
> Hi Philip,
>
>  Cascading is relatively agnostic about the distributed topology
> underneath it, especially as of the 2.0 release over a year ago. There's
> been some discussion about writing a flow planner for Spark -- e.g., which
> would replace the Hadoop flow planner. Not sure if there's active work on
> that yet.
>
>  There are a few commercial workflow abstraction layers (probably what
> was meant by "application layer" ?), in terms of the Cascading family
> (incl. Cascalog, Scalding), and also Actian's integration of
> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in the
> Py data stack.
>
>  Spark would not be at the same level of abstraction as Cascading
> (business logic, effectively); however, something like MLbase is ostensibly
> intended for that http://www.mlbase.org/
>
>  With respect to Spark, two other things to watch... One would definitely
> be the Py data stack and ability to integrate with PySpark, which is
> turning out to be very power abstraction -- quite close to a large segment
> of industry needs.  The other project to watch, on the Scala side, is
> Summingbird and it's evolution at Twitter:
> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
>
>  Paco
> http://amazon.com/dp/1449358721/
>
>
> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <philip.ogren@oracle.com>wrote:
>
>>
>> My team is investigating a number of technologies in the Big Data space.
>> A team member recently got turned on to Cascading<http://www.cascading.org/about-cascading/>as
an application layer for orchestrating complex workflows/scenarios.  He
>> asked me if Spark had an "application layer"?  My initial reaction is "no"
>> that Spark would not have a separate orchestration/application layer.
>> Instead, the core Spark API (along with Streaming) would compete directly
>> with Cascading for this kind of functionality and that the two would not
>> likely be all that complementary.  I realize that I am exposing my
>> ignorance here and could be way off.  Is there anyone who knows a bit about
>> both of these technologies who could speak to this in broad strokes?
>>
>> Thanks!
>> Philip
>>
>>
>
>

Mime
View raw message