spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Gummelt <mgumm...@mesosphere.io>
Subject Re: Mesos checkpointing
Date Wed, 24 May 2017 18:27:08 GMT
Ah, then yea, checkpointing should solve your problem.  Let's do that.

On Wed, May 24, 2017 at 11:19 AM, Charles Allen <
charles.allen@metamarkets.com> wrote:

> The issue on our side is we tend to roll out a bunch of agent updates at
> about the same time. So rolling an agent, then waiting for spark jobs to
> recover, then rolling another agent is not at all practical. It is a huge
> benefit if we can just update the agents in bulk (or even sequentially, but
> only waiting for the mesos agent to recover).
>
> On Wed, May 24, 2017 at 11:17 AM Michael Gummelt <mgummelt@mesosphere.io>
> wrote:
>
>> > We had investigated internally recently why restarting the mesos
>> agents failed the spark jobs (no real reason they should, right?) and came
>> across the data.
>>
>> Restarting the agent without checkpointing enabled will kill the
>> executor, but that still shouldn't cause the Spark job to fail, since Spark
>> jobs should tolerate executor failures.
>>
>> On Mon, Apr 3, 2017 at 2:26 PM, Timothy Chen <tnachen@gmail.com> wrote:
>>
>>> Yes, adding the timeout config should be the only code change required.
>>>
>>> And just to clarify, this is for reconnecting with Mesos master (not
>>> agents) after failover.
>>>
>>> Tim
>>>
>>> On Mon, Apr 3, 2017 at 2:23 PM, Charles Allen
>>> <charles.allen@metamarkets.com> wrote:
>>> > We had investigated internally recently why restarting the mesos agents
>>> > failed the spark jobs (no real reason they should, right?) and came
>>> across
>>> > the data. The other conversation by Yu sparked trying to poke to get
>>> some of
>>> > the tickets updated to spread around any tribal knowledge that is
>>> floating
>>> > in the community.
>>> >
>>> > It sounds like the only thing keeping it from being enabled is a
>>> timeout
>>> > config and someone volunteering to do some testing?
>>> >
>>> >
>>> > On Mon, Apr 3, 2017 at 2:19 PM Timothy Chen <tnachen@gmail.com> wrote:
>>> >>
>>> >> The only reason is that MesosClusterScheduler by design is long
>>> >> running so we really needed it to have failover configured correctly.
>>> >>
>>> >> I wanted to create a JIRA ticket to allow users to configure it for
>>> >> each Spark framework, but just didn't remember to do so.
>>> >>
>>> >> Per another question that came up in the mailing list, I believe we
>>> >> should add it as it's a fairly straight forward effort.
>>> >>
>>> >> Tim
>>> >>
>>> >> On Mon, Apr 3, 2017 at 2:16 PM, Charles Allen
>>> >> <charles.allen@metamarkets.com> wrote:
>>> >> > As per https://issues.apache.org/jira/browse/SPARK-4899
>>> >> >
>>> >> > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#
>>> createSchedulerDriver
>>> >> > allows checkpointing, but only
>>> >> > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler
>>> uses it.
>>> >> > Is
>>> >> > there a reason for that?
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere

Mime
View raw message