spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hien Luu <h...@linkedin.com.INVALID>
Subject Re: Spark job workflow engine recommendations
Date Fri, 07 Aug 2015 18:23:18 GMT
Scalability is a known issue due the the current architecture.  However
this will be applicable if you run more 20K jobs per day.

On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
> being phased out at LinkedIn because of scalability issues (though UI-wise,
> Azkaban seems better).
>
> Vikram:
> I suggest you do more research in related projects (maybe using their
> mailing lists).
>
> Disclaimer: I don't work for LinkedIn.
>
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <nick.pentreath@gmail.com>
> wrote:
>
>> Hi Vikram,
>>
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
>> local mode deployment and it is fairly easy to set up. It is pretty easy to
>> use and has a nice scheduling and logging interface, as well as SLAs (like
>> kill job and notify if it doesn't complete in 3 hours or whatever).
>>
>> However Spark support is not present directly - we run everything with
>> shell scripts and spark-submit. There is a plugin interface where one could
>> create a Spark plugin, but I found it very cumbersome when I did
>> investigate and didn't have the time to work through it to develop that.
>>
>> It has some quirks and while there is actually a REST API for adding jos
>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>> have to figure it out for yourself. But in terms of ease of use I found it
>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>> involved to set up. Haven't tried Luigi either.
>>
>> Spark job server is good but as you say lacks some stuff like scheduling
>> and DAG type workflows (independent of spark-defined job flows).
>>
>>
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>>
>>> Check also falcon in combination with oozie
>>>
>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hluu@linkedin.com.invalid> a
>>> écrit :
>>>
>>>> Looks like Oozie can satisfy most of your requirements.
>>>>
>>>>
>>>>
>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vikramkone@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I'm looking for open source workflow tools/engines that allow us to
>>>>> schedule spark jobs on a datastax cassandra cluster. Since there are
tonnes
>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
>>>>> wanted to check with people here to see what they are using today.
>>>>>
>>>>> Some of the requirements of the workflow engine that I'm looking for
>>>>> are
>>>>>
>>>>> 1. First class support for submitting Spark jobs on Cassandra. Not
>>>>> some wrapper Java code to submit tasks.
>>>>> 2. Active open source community support and well tested at production
>>>>> scale.
>>>>> 3. Should be dead easy to write job dependencices using XML or web
>>>>> interface . Ex; job A depends on Job B and Job C, so run Job A after
B and
>>>>> C are finished. Don't need to write full blown java applications to specify
>>>>> job parameters and dependencies. Should be very simple to use.
>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at a given
>>>>> time every hour or day or week or month.
>>>>> 5. Job monitoring, alerting on failures and email notifications on
>>>>> daily basis.
>>>>>
>>>>> I have looked at Ooyala's spark job server which seems to be hated
>>>>> towards making spark jobs run faster by sharing contexts between the
jobs
>>>>> but isn't a full blown workflow engine per se. A combination of spark
job
>>>>> server and workflow engine would be ideal
>>>>>
>>>>> Thanks for the inputs
>>>>>
>>>>
>>>>
>>
>

Mime
View raw message