spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: more uniform exception handling?
Date Tue, 19 Apr 2016 09:45:03 GMT

On 18 Apr 2016, at 20:16, Reynold Xin <rxin@databricks.com<mailto:rxin@databricks.com>>
wrote:

Josh's pull request<https://github.com/apache/spark/pull/12433> on rpc exception handling
got me to think ...

In my experience, there have been a few things related exceptions that created a lot of trouble
for us in production debugging:

1. Some exception is thrown, but is caught by some try/catch that does not do any logging
nor rethrow.
2. Some exception is thrown, but is caught by some try/catch that does not do any logging,
but do rethrow. But the original exception is now masked.
2. Multiple exceptions are logged at different places close to each other, but we don't know
whether they are caused by the same problem or not.


To mitigate some of the above, here's an idea ...

(1) Create a common root class for all the exceptions (e.g. call it SparkException) used in
Spark. We should make sure every time we catch an exception from a 3rd party library, we rethrow
them as SparkException (a lot of places already do that). In SparkException's constructor,
log the exception and the stacktrace.

(2) SparkException has a monotonically increasing ID, and this ID appears in the exception
error message (say at the end).


I think (1) will eliminate most of the cases that an exception gets swallowed. The main downside
I can think of is we might log an exception multiple times. However, I'd argue exceptions
should be rare, and it is not that big of a deal to log them twice or three times. The unique
ID (2) can help us correlate exceptions if they appear multiple times.

Thoughts?






1. unique IDs is a nice touch
2. there are some exceptions where code really needs to match on them, usually in the network
layer, also interruptedException. Its dangerous to swallow them.
3. I've done work on other projects (Slider, with YARN-679 to get them  into Hadoop) where
exceptions can also declare an exit code. This means system exits can have different exit
codes for different problems —and the exception raising code gets to choose it. For extra
fun, the set of exit codes attempt to lift numbers from HTTP errors, so "41" is Unauthed,
from HTTP 401: https://slider.incubator.apache.org/docs/exitcodes.html
4. Once you have different exit codes, then you can start writing tests for the scripts designed
to trigger failures —asserting about the exit code as way to assess the outcome

Something else to consider is "what can be added atop the classic runtime exceptions to make
them useful. Hadoop's NetUtils.wrapException() does this: catches things coming up from the
network stack and rethrows an exception of the same type (where possible), but now with source/dest
hostnames and ports. That is incredibly useful. The exceptions also tack in wiki references
to what the exceptions mean in a desparate attempt to reduce the #of JIRAs complaining about
services refusing connections. Its hard to tell how often that works —some people do now
just paste in the stack trace without reading the wiki link. At least now there's somewhere
to point them at when the issue is closed as invalid. [ see: http://steveloughran.blogspot.co.uk/2011/09/note-on-distributed-computing.html

I'm now considering what could be done at the Kerberos layer too, though there the problem
is that the JVM Exception is invariably a meaningless "Failure Unspecified at GSS API Level"
+ text which varies across JVM vendor and versions. Maybe the wiki URL should just point to
page saying "nobody understands kerberos —sorry"
Mime
View raw message