flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Sun <ha...@zendesk.com>
Subject Re: java.lang.Exception: TaskManager was lost/killed
Date Mon, 09 Apr 2018 20:47:34 GMT
Same story here, 1.3.2 on K8s. Very hard to find reasons on why a TM is
killed. Not likely caused by memory leak. If there is a logger I have turn
on please let me know.

On Mon, Apr 9, 2018, 13:41 Lasse Nedergaard <lassenedergaard@gmail.com>
wrote:

> We see the same running 1.4.2 on Yarn hosted on Aws EMR cluster. The only
> thing I can find in the logs from are SIGTERM with the code 15 or -100.
> Today our simple job reading from Kinesis and writing to Cassandra was
> killed. The other day in another job I identified a map state.remove
> command to cause a task manager lost without and exception
> I find it frustrating that it is so hard to find the root cause.
> If I look on historical metrics on cpu, heap and non heap I can’t see
> anything that should cause a problem.
> So any ideas about how to debug this kind of exception is much
> appreciated.
>
> Med venlig hilsen / Best regards
> Lasse Nedergaard
>
>
> Den 9. apr. 2018 kl. 21.48 skrev Chesnay Schepler <chesnay@apache.org>:
>
> We will need more information to offer any solution. The exception simply
> means that a TaskManager shut down, for which there are a myriad of
> possible explanations.
>
> Please have a look at the TaskManager logs, they may contain a hint as to
> why it shut down.
>
> On 09.04.2018 16:01, Javier Lopez wrote:
>
> Hi,
>
> "are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  " -> Yes,
> to every node in the cluster.
>
> On 9 April 2018 at 15:37, miki haiat <miko5054@gmail.com> wrote:
>
>> Javier
>> "adding the jar file to the /lib path of every task manager"
>> are you moving the job  jar to  the* ~/flink-1.4.2/lib path* ?
>>
>> On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez <javier.lopez@zalando.de>
>> wrote:
>>
>>> Hi,
>>>
>>> We had the same metaspace problem, it was solved by adding the jar file
>>> to the /lib path of every task manager, as explained here
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading
>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading>.
As well we
>>> added these java options: "-XX:CompressedClassSpaceSize=100M
>>> -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M "
>>>
>>> From time to time we have the same problem with TaskManagers
>>> disconnecting, but the logs are not useful. We are using 1.3.2.
>>>
>>> On 9 April 2018 at 10:41, Alexander Smirnov <
>>> alexander.smirnoff@gmail.com> wrote:
>>>
>>>> I've seen similar problem, but it was not a heap size, but Metaspace.
>>>> It was caused by a job restarting in a loop. Looks like for each
>>>> restart, Flink loads new instance of classes and very soon in runs out of
>>>> metaspace.
>>>>
>>>> I've created a JIRA issue for this problem, but got no response from
>>>> the development team on it:
>>>> https://issues.apache.org/jira/browse/FLINK-9132
>>>> <https://issues.apache.org/jira/browse/FLINK-9132>
>>>>
>>>>
>>>> On Mon, Apr 9, 2018 at 11:36 AM 王凯 <wangkaibg@163.com> wrote:
>>>>
>>>>> thanks a lot,i will try it
>>>>>
>>>>> 在 2018-04-09 00:06:02,"TechnoMage" <mlatta@technomage.com>
写道:
>>>>>
>>>>> I have seen this when my task manager ran out of RAM.  Increase the
>>>>> heap size.
>>>>>
>>>>> flink-conf.yaml:
>>>>> taskmanager.heap.mb
>>>>> jobmanager.heap.mb
>>>>>
>>>>> Michael
>>>>>
>>>>> On Apr 8, 2018, at 2:36 AM, 王凯 <wangkaibg@163.com> wrote:
>>>>>
>>>>> <QQ图片20180408163927.png>
>>>>> hi all, recently, i found a problem,it runs well when start. But
>>>>> after long run,the exception display as above,how can resolve it?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>

Mime
View raw message