spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Albertsson <la...@mapflat.com>
Subject Re: Spark Job trigger in production
Date Thu, 21 Jul 2016 09:52:43 GMT
I assume that you would like to trigger Spark batch jobs, and not
streaming jobs.

For production jobs, I recommend avoiding scheduling batch jobs
directly with cron or cron services like Chronos. Sometimes, jobs will
fail, either due to missing input data, or due to execution problems.
When it happens, you will need a mechanism to backfill missing
datasets by retrying jobs, or your system will be brittle.

The component that does this for you is called a workflow manager. I
suggest using either Luigi (https://github.com/spotify/luigi) or
Airflow (https://github.com/apache/incubator-airflow). You will need
to periodically schedule the workflow manager to evaluate your
pipeline status and run jobs (at least with Luigi), but the workflow
manager verifies input data presence before starting jobs, and can
cover up for transient failures and delayed input data, making the
system as a whole stable.

Oozie, mentioned in this thread, is also a workflow manager. It has an
XML-based DSL, however. It is therefore syntax-wise clumsy, and
limited in expressivity, which prevents you from using it for some
complex, but common scenarios, e.g. pipelines requiring dynamic
dependencies.

Some frameworks for running services are capable of also executing
batch jobs, e.g. Aurora and Kubernetes. They have weak DSLs for
expressing dependencies, however, and are suitable only if you have
simple pipelines. They are useful if you want to run Spark Streaming
jobs, however. Marathon did not support batch jobs last I checked, and
is only useful for streaming scenarios.

You can find more context and advice on running batch jobs in
production from the resources in this list, under the sections "End to
end" and "Batch processing":
http://www.mapflat.com/lands/resources/reading-list/

Regards,


Lars Albertsson
Data engineering consultant
www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109
Calendar: https://goo.gl/6FBtlS



On Wed, Jul 20, 2016 at 3:47 PM, Sathish Kumaran Vairavelu
<vsathishkumaran@gmail.com> wrote:
> If you are using Mesos, then u can use Chronos or Marathon
>
> On Wed, Jul 20, 2016 at 6:08 AM Rabin Banerjee
> <dev.rabin.banerjee@gmail.com> wrote:
>>
>> ++ crontab :)
>>
>> On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich <andrew@aehrlich.com>
>> wrote:
>>>
>>> Another option is Oozie with the spark action:
>>> https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
>>>
>>> On Jul 18, 2016, at 12:15 AM, Jagat Singh <jagatsingh@gmail.com> wrote:
>>>
>>> You can use following options
>>>
>>> * spark-submit from shell
>>> * some kind of job server. See spark-jobserver for details
>>> * some notebook environment See Zeppelin for example
>>>
>>>
>>>
>>>
>>>
>>> On 18 July 2016 at 17:13, manish jaiswal <manishsrm14@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>> What is the best approach to trigger spark job in production cluster?
>>>
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message