spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prashant Sharma <scrapco...@gmail.com>
Subject Re: compare/contrast Spark with Cascading
Date Tue, 29 Oct 2013 05:12:21 GMT
Hey Koert,

Can you give me steps to reproduce this ?


On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers <koert@tresata.com> wrote:

> Matei,
> We have some jobs where even the input for a single key in a groupBy would
> not fit in the the tasks memory. We rely on mapred to stream from disk to
> disk as it reduces.
> I think spark should be able to handle that situation to truly be able to
> claim it can replace map-red (or not?).
> Best, Koert
>
>
> On Mon, Oct 28, 2013 at 8:51 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:
>
>> FWIW, the only thing that Spark expects to fit in memory if you use
>> DISK_ONLY caching is the input to each reduce task. Those currently don't
>> spill to disk. The solution if datasets are large is to add more reduce
>> tasks, whereas Hadoop would run along with a small number of tasks that do
>> lots of disk IO. But this is something we will likely change soon. Other
>> than that, everything runs in a streaming fashion and there's no need for
>> the data to fit in memory. Our goal is certainly to work on any size
>> datasets, and some of our current users are explicitly using Spark to
>> replace things like Hadoop Streaming in just batch jobs (see e.g. Yahoo!'s
>> presentation from http://ampcamp.berkeley.edu/3/). If you run into
>> trouble with these, let us know, since it is an explicit goal of the
>> project to support it.
>>
>> Matei
>>
>> On Oct 28, 2013, at 5:32 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>> no problem :) i am actually not familiar with what oscar has said on
>> this. can you share or point me to the conversation thread?
>>
>> it is my opinion based on the little experimenting i have done. but i am
>> willing to be convinced otherwise.
>> one the very first things i did when we started using spark is run jobs
>> with DISK_ONLY, and see if it could some of the jobs that map-reduce does
>> for us. however i ran into OOMs, presumably because spark makes assumptions
>> that some things should fit in memory. i have to admit i didn't try too
>> hard after the first OOMs.
>>
>> if spark were able to scale from the quick in-memory query to the
>> overnight disk-only giant batch query, i would love it! spark has a much
>> nicer api than map-red, and one could use a single set of algos for
>> everything from quick/realtime queries to giant batch jobs. as far as i am
>> concerned map-red would be done. our clusters of the future would be hdfs +
>> spark.
>>
>>
>> On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra <mark@clearstorydata.com>wrote:
>>
>>> And I didn't mean to skip over you, Koert.  I'm just more familiar with
>>> what Oscar said on the subject than with your opinion.
>>>
>>>
>>>
>>> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra <mark@clearstorydata.com>wrote:
>>>
>>>> Hmmm... I was unaware of this concept that Spark is for medium to large
>>>>> datasets but not for very large datasets.
>>>>
>>>>
>>>> It is in the opinion of some at Twitter.  That doesn't make it true or
>>>> a universally held opinion.
>>>>
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <arangole@gmail.com>wrote:
>>>>
>>>>> Hmmm... I was unaware of this concept that Spark is for medium to
>>>>> large datasets but not for very large datasets. What size is very large?
>>>>>
>>>>> Can someone please elaborate on why this would be the case and what
>>>>> stops Spark, as it is today, to be successfully run on very large datasets?
>>>>> I'll appreciate it.
>>>>>
>>>>> I would think that Spark should be able to pull off Hadoop level
>>>>> throughput in worst case with DISK_ONLY caching.
>>>>>
>>>>> Thanks
>>>>> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <koert@tresata.com> wrote:
>>>>>
>>>>>> i would say scaling (cascading + DSL for scala) offers similar
>>>>>> functionality to spark, and a similar syntax.
>>>>>> the main difference between spark and scalding is target jobs:
>>>>>> scalding is for long running jobs on very large data. the data is
>>>>>> read from and written to disk between steps. jobs run from minutes
to days.
>>>>>> spark is for faster jobs on medium to large data. the data is
>>>>>> primarily held in memory. jobs run from a few seconds to a few hours.
>>>>>> although spark can work with data on disks it still makes assumptions
that
>>>>>> data needs to fit in memory for certain steps (although less and
less with
>>>>>> every release). spark also makes iterative designs much easier.
>>>>>>
>>>>>> i have found them both great to program in and complimentary. we
use
>>>>>> scalding for overnight batch processes and spark for more realtime
>>>>>> processes. at this point i would trust scalding a lot more due to
the
>>>>>> robustness of the stack, but spark is getting better every day.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <ceteri@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Philip,
>>>>>>>
>>>>>>> Cascading is relatively agnostic about the distributed topology
>>>>>>> underneath it, especially as of the 2.0 release over a year ago.
There's
>>>>>>> been some discussion about writing a flow planner for Spark --
e.g., which
>>>>>>> would replace the Hadoop flow planner. Not sure if there's active
work on
>>>>>>> that yet.
>>>>>>>
>>>>>>> There are a few commercial workflow abstraction layers (probably
>>>>>>> what was meant by "application layer" ?), in terms of the Cascading
family
>>>>>>> (incl. Cascalog, Scalding), and also Actian's integration of
>>>>>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others
in the
>>>>>>> Py data stack.
>>>>>>>
>>>>>>> Spark would not be at the same level of abstraction as Cascading
>>>>>>> (business logic, effectively); however, something like MLbase
is ostensibly
>>>>>>> intended for that http://www.mlbase.org/
>>>>>>>
>>>>>>> With respect to Spark, two other things to watch... One would
>>>>>>> definitely be the Py data stack and ability to integrate with
PySpark,
>>>>>>> which is turning out to be very power abstraction -- quite close
to a large
>>>>>>> segment of industry needs.  The other project to watch, on the
>>>>>>> Scala side, is Summingbird and it's evolution at Twitter:
>>>>>>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
>>>>>>>
>>>>>>> Paco
>>>>>>> http://amazon.com/dp/1449358721/
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <
>>>>>>> philip.ogren@oracle.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> My team is investigating a number of technologies in the
Big Data
>>>>>>>> space.  A team member recently got turned on to Cascading<http://www.cascading.org/about-cascading/>as
an application layer for orchestrating complex workflows/scenarios.  He
>>>>>>>> asked me if Spark had an "application layer"?  My initial
reaction is "no"
>>>>>>>> that Spark would not have a separate orchestration/application
layer.
>>>>>>>> Instead, the core Spark API (along with Streaming) would
compete directly
>>>>>>>> with Cascading for this kind of functionality and that the
two would not
>>>>>>>> likely be all that complementary.  I realize that I am exposing
my
>>>>>>>> ignorance here and could be way off.  Is there anyone who
knows a bit about
>>>>>>>> both of these technologies who could speak to this in broad
strokes?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Philip
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>


-- 
s

Mime
View raw message