spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: compare/contrast Spark with Cascading
Date Tue, 29 Oct 2013 00:51:49 GMT
FWIW, the only thing that Spark expects to fit in memory if you use DISK_ONLY caching is the
input to each reduce task. Those currently don't spill to disk. The solution if datasets are
large is to add more reduce tasks, whereas Hadoop would run along with a small number of tasks
that do lots of disk IO. But this is something we will likely change soon. Other than that,
everything runs in a streaming fashion and there's no need for the data to fit in memory.
Our goal is certainly to work on any size datasets, and some of our current users are explicitly
using Spark to replace things like Hadoop Streaming in just batch jobs (see e.g. Yahoo!'s
presentation from http://ampcamp.berkeley.edu/3/). If you run into trouble with these, let
us know, since it is an explicit goal of the project to support it.

Matei

On Oct 28, 2013, at 5:32 PM, Koert Kuipers <koert@tresata.com> wrote:

> no problem :) i am actually not familiar with what oscar has said on this. can you share
or point me to the conversation thread?
> 
> it is my opinion based on the little experimenting i have done. but i am willing to be
convinced otherwise.
> one the very first things i did when we started using spark is run jobs with DISK_ONLY,
and see if it could some of the jobs that map-reduce does for us. however i ran into OOMs,
presumably because spark makes assumptions that some things should fit in memory. i have to
admit i didn't try too hard after the first OOMs.
> 
> if spark were able to scale from the quick in-memory query to the overnight disk-only
giant batch query, i would love it! spark has a much nicer api than map-red, and one could
use a single set of algos for everything from quick/realtime queries to giant batch jobs.
as far as i am concerned map-red would be done. our clusters of the future would be hdfs +
spark.
> 
> 
> On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra <mark@clearstorydata.com> wrote:
> And I didn't mean to skip over you, Koert.  I'm just more familiar with what Oscar said
on the subject than with your opinion.
> 
> 
> 
> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra <mark@clearstorydata.com> wrote:
> Hmmm... I was unaware of this concept that Spark is for medium to large datasets but
not for very large datasets.
>  
> It is in the opinion of some at Twitter.  That doesn't make it true or a universally
held opinion.
> 
> 
> 
> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <arangole@gmail.com> wrote:
> Hmmm... I was unaware of this concept that Spark is for medium to large datasets but
not for very large datasets. What size is very large?
> 
> Can someone please elaborate on why this would be the case and what stops Spark, as it
is today, to be successfully run on very large datasets? I'll appreciate it.
> 
> I would think that Spark should be able to pull off Hadoop level throughput in worst
case with DISK_ONLY caching.
> 
> Thanks
> 
> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <koert@tresata.com> wrote:
> i would say scaling (cascading + DSL for scala) offers similar functionality to spark,
and a similar syntax. 
> the main difference between spark and scalding is target jobs: 
> scalding is for long running jobs on very large data. the data is read from and written
to disk between steps. jobs run from minutes to days.
> spark is for faster jobs on medium to large data. the data is primarily held in memory.
jobs run from a few seconds to a few hours. although spark can work with data on disks it
still makes assumptions that data needs to fit in memory for certain steps (although less
and less with every release). spark also makes iterative designs much easier.
> 
> i have found them both great to program in and complimentary. we use scalding for overnight
batch processes and spark for more realtime processes. at this point i would trust scalding
a lot more due to the robustness of the stack, but spark is getting better every day.
> 
> 
> 
> 
> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <ceteri@gmail.com> wrote:
> Hi Philip,
> 
> Cascading is relatively agnostic about the distributed topology underneath it, especially
as of the 2.0 release over a year ago. There's been some discussion about writing a flow planner
for Spark -- e.g., which would replace the Hadoop flow planner. Not sure if there's active
work on that yet.
> 
> There are a few commercial workflow abstraction layers (probably what was meant by "application
layer" ?), in terms of the Cascading family (incl. Cascalog, Scalding), and also Actian's
integration of Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in the Py
data stack.
> 
> Spark would not be at the same level of abstraction as Cascading (business logic, effectively);
however, something like MLbase is ostensibly intended for that http://www.mlbase.org/
> 
> With respect to Spark, two other things to watch... One would definitely be the Py data
stack and ability to integrate with PySpark, which is turning out to be very power abstraction
-- quite close to a large segment of industry needs.  The other project to watch, on the Scala
side, is Summingbird and it's evolution at Twitter: https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
> 
> Paco
> http://amazon.com/dp/1449358721/
> 
> 
> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <philip.ogren@oracle.com> wrote:
> 
> My team is investigating a number of technologies in the Big Data space.  A team member
recently got turned on to Cascading as an application layer for orchestrating complex workflows/scenarios.
 He asked me if Spark had an "application layer"?  My initial reaction is "no" that Spark
would not have a separate orchestration/application layer.  Instead, the core Spark API (along
with Streaming) would compete directly with Cascading for this kind of functionality and that
the two would not likely be all that complementary.  I realize that I am exposing my ignorance
here and could be way off.  Is there anyone who knows a bit about both of these technologies
who could speak to this in broad strokes?  
> 
> Thanks!
> Philip
> 
> 
> 
> 
> 
> 


Mime
View raw message