spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sharma, Praneet" <>
Subject RE: Classloading issues when using connectors with Uber jars with improper Shading in single Spark job
Date Mon, 16 Sep 2019 02:30:51 GMT
Bumping. All inputs appreciated

From: Sharma, Praneet
Sent: Friday, August 23, 2019 11:24 AM
To: '' <>
Subject: Classloading issues when using connectors with Uber jars with improper Shading in
single Spark job

Hi Guys

When using connectors with Uber jars, we are hitting classloading issues in Spark 2.3.0. Upon
investigation we found out that the classloading issues were caused by improper shading of
certain classes in these uber jars. The aim of this email is to start a discussion on whether
such issues can be mitigated/avoided from Spark core, and if yes, then how.

Issue Summary

We have built a Spark job using Spark version 2.3.0 which reads from Azure Cosmos DB. In Spark,
to read from cosmos DB, we are relying on an Uber jar provided by Azure - azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar<>.
To add this jar to the Spark driver and executor classpaths, we are relying on the properties
spark.driver.extraClasspath and spark.executor.extraClasspath, respectively. When this Spark
job is run, we hit the following issue:

ERROR ApplicationMaster: User class threw exception: java.lang.VerifyError: Bad type on operand
Exception Details:
  Location:    org/apache/spark/metrics/sink/MetricsServlet.<init>(Ljava/util/Properties;Lcom/codahale/metrics/MetricRegistry;Lorg/apache/spark/SecurityManager;)V
@116: invokevirtual
  Reason:    Type 'com/codahale/metrics/json/MetricsModule' (current frame, stack[2]) is not
assignable to 'com/fasterxml/jackson/databind/Module'

We have done some analysis on this issue from our side which we are detailing in the below

Issue Details

We have a Spark 2.3.0 setup to work with Cloudera cluster in yarn-cluster mode. By default,
Spark's driver and executor classpaths, among others, have the following jars:
*       jackson-databind-2.6.5.jar
*       metrics-json-3.1.2.jar
*       spark-core_2.11-2.3.1.jar

To work with cosmos DB, the jar we have explicitly added in the classpaths is azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar.
This is an uber jar and is shaded with prefix "cosmosdb_connector_shaded", but the shading
is improper, meaning some classes are left with their original fully qualified names - this
is the origin of the classloading issue I mentioned above. The below table highlights the
classes of interest being present in one of the above mentioned jars and the order in which
Spark's MutableURLClassLoader might attempt to load them:

Order   Jar     Classes of Interest
1       azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar  com.codahale.metrics.json.MetricsModule
2       jackson-databind-2.6.5.jar      com.fasterxml.jackson.databind.Module
3       metrics-json-3.1.2.jar  com.codahale.metrics.json.MetricsModule
4       spark-core_2.11-2.3.1.jar       org.apache.spark.metrics.sink.MetricsServlet

Please note that classes from cosmosdb-spark TPL will get loaded first because this jar is
added to the top of URL classpath pile of MutableURLClassLoader. To reiterate, the error we
see is: "Type 'com/codahale/metrics/json/MetricsModule' (current frame, stack[2]) is not assignable
to 'com/fasterxml/jackson/databind/Module'". And according to our analysis, the following
is the reason why this occurs:

*       When the Spark driver comes up, it attempts to register an instance of MetricsModule
class. MetricsModule is a concrete class implementation of Module class, which means the Spark
driver classloader will attempt to load both these classes.
*       The Module class is loaded from jackson-databind-2.6.5.jar class because even though
the same class is present in the uber jar but it is shaded, hence its fully qualified classname
is different from the one which is attempted to being loaded.
*       On the other hand, when the MetricsModule class is loaded, it gets loaded from the
uber jar. This is because, due to improper nature of shading plugin, the MetricsModule class
was not shaded in the uber jar, and also because uber jar is at the top of the classpath pile.
*       This results in the above exception because the MetricsModule  class present in the
uber jar was not compatible with the Module class present in jackson-databind-2.6.5.jar.

This is one example where the issue occurs. We have seen similar issues occurring when we
attempt to use both Cosmos DB and Snowflake connectors in a single Spark job. Please note
that we only use those TPLs and their versions which the vendors officially announce as supported
for a particular Spark version.

How the above error could have been avoided?

*       If the uber jar had been properly shaded, the MetricsModule class would have been
loaded from metrics-json-3.1.2.jar which is compatible with jackson-databind-2.6.5.jar containing
the abstract class Module.
o       From what we understand, total shading is not possible due to a variety of reasons.
*       Or, If the dependencies embedded in the uber jar had been compatible with the same
dependencies present elsewhere, then even with improperly shaded uber jar, it would not have
mattered the jar from where the duplicate dependency class is being loaded.
o       But this is in vendor control and not in our control. Connectors by different vendors
can work separately in Spark but using them together in a single job can get problematic.
We see this when attempting to use both cosmos DB and Snowflake in a single Spark job.

Child-first Classloader as a solution for wrapping connectors in Spark?

We are able to resolve these issues on Spark driver by wrapping load and save Dataframe calls
in their separate child-first classloaders and ensuring the Uber TPLs are only present in
these child-first classloaders. For example, the child-first classloader (child of MutableURLClassloader)
wrapping cosmos db's load method will only have azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar
in its classpath. But we are not able to achieve the same with Spark executors because we
don't have much control there. To achieve a similar thing with cosmos DB spark executor code,
we will be required to modify its open-source code and wrap the lambda within mapPartitionsWithIndex
with a child-first classloader. This would mean doing this individually for each connector
we want to work with, which is not viable.

Thus we reach out to you guys to figure out how such issues can be tackled. Is there a way
to achieve isolated classloading for connectors reading and writing portions in Spark? If
yes, how can that be achieved? If not, what other choices do we have here? What does spark
community recommend we do to support our use cases mentioned here?


View raw message