spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Bordelon <a...@mesosphere.io>
Subject Re: Spark (Streaming?) holding on to Mesos Resources
Date Tue, 27 Jan 2015 08:23:14 GMT
> Hopefully some very bad ugly bug that has been fixed already and that
will urge us to upgrade our infra?
> Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
Could be https://issues.apache.org/jira/browse/MESOS-1688 (fixed in Mesos
0.21)

On Mon, Jan 26, 2015 at 2:45 PM, Gerard Maas <gerard.maas@gmail.com> wrote:

> Hi Jörn,
>
> A memory leak on the job would be contained within the resources reserved
> for it, wouldn't it?
> And the job holding resources is not always the same. Sometimes it's one
> of the Streaming jobs, sometimes it's a heavy batch job that runs every
> hour.
> Looks to me that whatever is causing the issue, it's participating in the
> resource offer protocol of Mesos and my first suspect would be the Mesos
> scheduler in Spark. (The table above is the tab "Offers" from the Mesos UI.
>
> Are there any other factors involved in the offer acceptance/rejection
> between Mesos and a scheduler?
>
> What do you think?
>
> -kr, Gerard.
>
> On Mon, Jan 26, 2015 at 11:23 PM, Jörn Franke <jornfranke@gmail.com>
> wrote:
>
>> Hi,
>>
>> What do your jobs do?  Ideally post source code, but some description
>> would already helpful to support you.
>>
>> Memory leaks can have several reasons - it may not be Spark at all.
>>
>> Thank you.
>>
>> Le 26 janv. 2015 22:28, "Gerard Maas" <gerard.maas@gmail.com> a écrit :
>>
>> >
>> > (looks like the list didn't like a HTML table on the previous email. My
>> excuses for any duplicates)
>> >
>> > Hi,
>> >
>> > We are observing with certain regularity that our Spark  jobs, as Mesos
>> framework, are hoarding resources and not releasing them, resulting in
>> resource starvation to all jobs running on the Mesos cluster.
>> >
>> > For example:
>> > This is a job that has spark.cores.max = 4 and
>> spark.executor.memory="3g"
>> >
>> > | ID               |Framework      |Host                |CPUs  |Mem
>> > …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB
>> > …5050-16506-1146495 FooStreaming    dnode-0.hdfs.private 1 6.4 GB
>> > …5050-16506-1146491 FooStreaming    dnode-5.hdfs.private 7 11.9 GB
>> > …5050-16506-1146449 FooStreaming    dnode-3.hdfs.private 7 4.9 GB
>> > …5050-16506-1146247 FooStreaming    dnode-1.hdfs.private 0.5 5.9 GB
>> > …5050-16506-1146226 FooStreaming    dnode-2.hdfs.private 3 7.9 GB
>> > …5050-16506-1144069 FooStreaming    dnode-3.hdfs.private 1 8.7 GB
>> > …5050-16506-1133091 FooStreaming    dnode-5.hdfs.private 1 1.7 GB
>> > …5050-16506-1133090 FooStreaming    dnode-2.hdfs.private 5 5.2 GB
>> > …5050-16506-1133089 FooStreaming    dnode-1.hdfs.private 6.5 6.3 GB
>> > …5050-16506-1133088 FooStreaming    dnode-4.hdfs.private 1 251 MB
>> > …5050-16506-1133087 FooStreaming    dnode-0.hdfs.private 6.4 6.8 GB
>> >
>> > The only way to release the resources is by manually finding the
>> process in the cluster and killing it. The jobs are often streaming but
>> also batch jobs show this behavior. We have more streaming jobs than batch,
>> so stats are biased.
>> > Any ideas of what's up here? Hopefully some very bad ugly bug that has
>> been fixed already and that will urge us to upgrade our infra?
>> >
>> > Mesos 0.20 +  Marathon 0.7.4 + Spark 1.1.0
>> >
>> > -kr, Gerard.
>>
>>
>

Mime
View raw message