flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Sue (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10988) Improve debugging / visibility of job state
Date Tue, 05 Mar 2019 03:41:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784029#comment-16784029

Scott Sue commented on FLINK-10988:

This is something that we can do in the code.  However, the sledgehammer approach would be
to then have to wrap every Flink Operator to ensure that it doesn't unexpectedly fail and
ultimately kill the job itself.  I would have thought this would be something that most users
would want to help trace any issues within their job.

Even if the job did still stopped due to an exception. It would be nice to have some extra information
in the logs printed as to what it was attempting to perform as opposed to just a stacktrace? 
In my experience with Flink, it's quite hard to track down exactly what the state of the Operator
was along with the event that it was processing at the time to trace the root cause of the
issue.  It would be nice to have some out of the box tools to get to this information quicker.

> Improve debugging / visibility of job state
> -------------------------------------------
>                 Key: FLINK-10988
>                 URL: https://issues.apache.org/jira/browse/FLINK-10988
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Operators
>            Reporter: Scott Sue
>            Priority: Major
> When a Flink Job is running and encounters an unexpected exception, either through processing
an expected message, or a message that may be well formed, but the state of the job renders
a exception.  It can be difficult to diagnose the cause of the issue.  For example I would
get a NPE in one of the Operators:
> 2018-11-13 10:10:26,332 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph -
Co-Process-Broadcast-Keyed -> Map -> Map -> Sin
> k: Unnamed (1/1) (9a8f3b970570742b7b174a01a9bb1405) switched from RUNNING to FAILED.
> java.lang.NullPointerException
>  at com.celertech.analytics.flink.topology.marketimpact.PriceUtils.findPriceForEntryType(PriceUtils.java:28)
>  at com.celertech.analytics.flink.topology.marketimpact.PriceUtils.getPriceForMarketDataEntryType(PriceUtils.java:18)
>  at com.celertech.analytics.flink.function.midrate.MidRateBroadcaster.processBroadcastElement(MidRateBroadcaster.java:77)
>  at com.celertech.analytics.flink.function.midrate.MidRateTagKeyedBroadcastProcessFunction.processBroadcastElement(MidRateTagKeyedBroa
> dcastProcessFunction.java:36)
>  at com.celertech.analytics.flink.function.midrate.MidRateTagKeyedBroadcastProcessFunction.processBroadcastElement(MidRateTagKeyedBroa
> dcastProcessFunction.java:12)
>  at org.apache.flink.streaming.api.operators.co.CoBroadcastWithKeyedOperator.processElement2(CoBroadcastWithKeyedOperator.java:121)
> An improvement to this would be to allow the printing of the incoming message so the
developer can diagnose if that message was correct.  Printing of the state of the job would
be nice as well just in case the state of the job was incorrect leading to the exception

This message was sent by Atlassian JIRA

View raw message