spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: compare/contrast Spark with Cascading
Date Tue, 29 Oct 2013 00:32:51 GMT
no problem :) i am actually not familiar with what oscar has said on this.
can you share or point me to the conversation thread?

it is my opinion based on the little experimenting i have done. but i am
willing to be convinced otherwise.
one the very first things i did when we started using spark is run jobs
with DISK_ONLY, and see if it could some of the jobs that map-reduce does
for us. however i ran into OOMs, presumably because spark makes assumptions
that some things should fit in memory. i have to admit i didn't try too
hard after the first OOMs.

if spark were able to scale from the quick in-memory query to the overnight
disk-only giant batch query, i would love it! spark has a much nicer api
than map-red, and one could use a single set of algos for everything from
quick/realtime queries to giant batch jobs. as far as i am concerned
map-red would be done. our clusters of the future would be hdfs + spark.


On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra <mark@clearstorydata.com>wrote:

> And I didn't mean to skip over you, Koert.  I'm just more familiar with
> what Oscar said on the subject than with your opinion.
>
>
>
> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra <mark@clearstorydata.com>wrote:
>
>> Hmmm... I was unaware of this concept that Spark is for medium to large
>>> datasets but not for very large datasets.
>>
>>
>> It is in the opinion of some at Twitter.  That doesn't make it true or a
>> universally held opinion.
>>
>>
>>
>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole <arangole@gmail.com>wrote:
>>
>>> Hmmm... I was unaware of this concept that Spark is for medium to large
>>> datasets but not for very large datasets. What size is very large?
>>>
>>> Can someone please elaborate on why this would be the case and what
>>> stops Spark, as it is today, to be successfully run on very large datasets?
>>> I'll appreciate it.
>>>
>>> I would think that Spark should be able to pull off Hadoop level
>>> throughput in worst case with DISK_ONLY caching.
>>>
>>> Thanks
>>> On Oct 28, 2013 1:37 PM, "Koert Kuipers" <koert@tresata.com> wrote:
>>>
>>>> i would say scaling (cascading + DSL for scala) offers similar
>>>> functionality to spark, and a similar syntax.
>>>> the main difference between spark and scalding is target jobs:
>>>> scalding is for long running jobs on very large data. the data is read
>>>> from and written to disk between steps. jobs run from minutes to days.
>>>> spark is for faster jobs on medium to large data. the data is primarily
>>>> held in memory. jobs run from a few seconds to a few hours. although spark
>>>> can work with data on disks it still makes assumptions that data needs to
>>>> fit in memory for certain steps (although less and less with every
>>>> release). spark also makes iterative designs much easier.
>>>>
>>>> i have found them both great to program in and complimentary. we use
>>>> scalding for overnight batch processes and spark for more realtime
>>>> processes. at this point i would trust scalding a lot more due to the
>>>> robustness of the stack, but spark is getting better every day.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan <ceteri@gmail.com> wrote:
>>>>
>>>>> Hi Philip,
>>>>>
>>>>> Cascading is relatively agnostic about the distributed topology
>>>>> underneath it, especially as of the 2.0 release over a year ago. There's
>>>>> been some discussion about writing a flow planner for Spark -- e.g.,
which
>>>>> would replace the Hadoop flow planner. Not sure if there's active work
on
>>>>> that yet.
>>>>>
>>>>> There are a few commercial workflow abstraction layers (probably what
>>>>> was meant by "application layer" ?), in terms of the Cascading family
>>>>> (incl. Cascalog, Scalding), and also Actian's integration of
>>>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in
the
>>>>> Py data stack.
>>>>>
>>>>> Spark would not be at the same level of abstraction as Cascading
>>>>> (business logic, effectively); however, something like MLbase is ostensibly
>>>>> intended for that http://www.mlbase.org/
>>>>>
>>>>> With respect to Spark, two other things to watch... One would
>>>>> definitely be the Py data stack and ability to integrate with PySpark,
>>>>> which is turning out to be very power abstraction -- quite close to a
large
>>>>> segment of industry needs.  The other project to watch, on the Scala
>>>>> side, is Summingbird and it's evolution at Twitter:
>>>>> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
>>>>>
>>>>> Paco
>>>>> http://amazon.com/dp/1449358721/
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren <
>>>>> philip.ogren@oracle.com> wrote:
>>>>>
>>>>>>
>>>>>> My team is investigating a number of technologies in the Big Data
>>>>>> space.  A team member recently got turned on to Cascading<http://www.cascading.org/about-cascading/>as
an application layer for orchestrating complex workflows/scenarios.  He
>>>>>> asked me if Spark had an "application layer"?  My initial reaction
is "no"
>>>>>> that Spark would not have a separate orchestration/application layer.
>>>>>> Instead, the core Spark API (along with Streaming) would compete
directly
>>>>>> with Cascading for this kind of functionality and that the two would
not
>>>>>> likely be all that complementary.  I realize that I am exposing my
>>>>>> ignorance here and could be way off.  Is there anyone who knows a
bit about
>>>>>> both of these technologies who could speak to this in broad strokes?
>>>>>>
>>>>>> Thanks!
>>>>>> Philip
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Mime
View raw message