spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hien Luu <>
Subject Re: Spark job workflow engine recommendations
Date Tue, 11 Aug 2015 17:30:13 GMT
We are in the middle of figuring that out.  At the high level, we want to
combine the best parts of existing workflow solutions.

On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone <> wrote:

> Hien,
> Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin
> going to use for workflow scheduling? Is there something else that's going
> to replace Azkaban?
> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <> wrote:
>> In my opinion, choosing some particular project among its peers should
>> leave enough room for future growth (which may come faster than you
>> initially think).
>> Cheers
>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <> wrote:
>>> Scalability is a known issue due the the current architecture.  However
>>> this will be applicable if you run more 20K jobs per day.
>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <> wrote:
>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban is
>>>> being phased out at LinkedIn because of scalability issues (though UI-wise,
>>>> Azkaban seems better).
>>>> Vikram:
>>>> I suggest you do more research in related projects (maybe using their
>>>> mailing lists).
>>>> Disclaimer: I don't work for LinkedIn.
>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>>> wrote:
>>>>> Hi Vikram,
>>>>> We use Azkaban (2.5.0) in our production workflow scheduling. We just
>>>>> use local mode deployment and it is fairly easy to set up. It is pretty
>>>>> easy to use and has a nice scheduling and logging interface, as well
>>>>> SLAs (like kill job and notify if it doesn't complete in 3 hours or
>>>>> whatever).
>>>>> However Spark support is not present directly - we run everything with
>>>>> shell scripts and spark-submit. There is a plugin interface where one
>>>>> create a Spark plugin, but I found it very cumbersome when I did
>>>>> investigate and didn't have the time to work through it to develop that.
>>>>> It has some quirks and while there is actually a REST API for adding
>>>>> jos and dynamically scheduling jobs, it is not documented anywhere so
>>>>> kinda have to figure it out for yourself. But in terms of ease of use
>>>>> found it way better than Oozie. I haven't tried Chronos, and it seemed
>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>> Spark job server is good but as you say lacks some stuff like
>>>>> scheduling and DAG type workflows (independent of spark-defined job flows).
>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <>
>>>>> wrote:
>>>>>> Check also falcon in combination with oozie
>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <>
>>>>>> écrit :
>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> I'm looking for open source workflow tools/engines that allow
us to
>>>>>>>> schedule spark jobs on a datastax cassandra cluster. Since
there are tonnes
>>>>>>>> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos
etc, I
>>>>>>>> wanted to check with people here to see what they are using
>>>>>>>> Some of the requirements of the workflow engine that I'm
>>>>>>>> for are
>>>>>>>> 1. First class support for submitting Spark jobs on Cassandra.
>>>>>>>> some wrapper Java code to submit tasks.
>>>>>>>> 2. Active open source community support and well tested at
>>>>>>>> production scale.
>>>>>>>> 3. Should be dead easy to write job dependencices using XML
or web
>>>>>>>> interface . Ex; job A depends on Job B and Job C, so run
Job A after B and
>>>>>>>> C are finished. Don't need to write full blown java applications
to specify
>>>>>>>> job parameters and dependencies. Should be very simple to
>>>>>>>> 4. Time based  recurrent scheduling. Run the spark jobs at
a given
>>>>>>>> time every hour or day or week or month.
>>>>>>>> 5. Job monitoring, alerting on failures and email notifications
>>>>>>>> daily basis.
>>>>>>>> I have looked at Ooyala's spark job server which seems to
be hated
>>>>>>>> towards making spark jobs run faster by sharing contexts
between the jobs
>>>>>>>> but isn't a full blown workflow engine per se. A combination
of spark job
>>>>>>>> server and workflow engine would be ideal
>>>>>>>> Thanks for the inputs

View raw message