From dev-return-31378-apmail-spark-dev-archive=spark.apache.org@spark.apache.org Tue Nov 10 20:47:59 2020 Return-Path: X-Original-To: apmail-spark-dev-archive@locus.apache.org Delivered-To: apmail-spark-dev-archive@locus.apache.org Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by minotaur.apache.org (Postfix) with ESMTP id 4F5B0197BE for ; Tue, 10 Nov 2020 20:47:59 +0000 (UTC) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id F3C6C48197 for ; Tue, 10 Nov 2020 20:47:58 +0000 (UTC) Received: (qmail 11128 invoked by uid 500); 10 Nov 2020 20:47:49 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 10922 invoked by uid 500); 10 Nov 2020 20:47:49 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 10911 invoked by uid 99); 10 Nov 2020 20:47:49 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Nov 2020 20:47:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 4A0E41FF39C for ; Tue, 10 Nov 2020 20:47:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id xWGlw0auNSC6 for ; Tue, 10 Nov 2020 20:47:47 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.167.43; helo=mail-lf1-f43.google.com; envelope-from=stevel@cloudera.com; receiver= Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id E1F24BC2A1 for ; Tue, 10 Nov 2020 20:47:46 +0000 (UTC) Received: by mail-lf1-f43.google.com with SMTP id f11so52293lfs.3 for ; Tue, 10 Nov 2020 12:47:46 -0800 (PST) X-ASF-DKIM-Sig: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera.com; s=google; h=mime-version:from:date:message-id:subject:to; bh=MI9AqQZzSoxmMPyw9ddSL/Nqef1Zb/cQ1mzY2G7s2ho=; b=ZIxZcwVq8CMjk9l6ZLw9//GLzk948T1uEv0WSep7hhqiVIwJy+DVngIBbHDaes9mLZ bw11qcn9Z4M2o16H1C6+jw+BgSc0vbPS5pixjSiUl1yxyDLw0DS9QNAAHEEPjylM05YS UnL1ttjFtNEcPcdEYuBR7Ak25YcqoBCqPuQapDzBnneV6gdZqJN6yZaMLpImReOtpdef 4fPJmXRbn4LjvdRHznwBSa06BzxYuPTW04MSyi5fE7/Tans2rSjZmRQEBFed1HJOvvCc b4W5QDomF3XT9bLepG1d/e9D9IxYKPZK4TVv8aDUse+lT691/6sh2S1A4345/bgs5hRN gDOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=MI9AqQZzSoxmMPyw9ddSL/Nqef1Zb/cQ1mzY2G7s2ho=; b=lq9s7QSVdjzUPp3viOHaDJ5DRAnLdJpGsb9F9lxmBhupfnRHdaD9FLzWK8OBgAB+QW VClsiBFpE+qaYjdJRmYajCZR5O/wyGXids4TiDrrJ1ezd6lWsL/tbOaEUWf2v5SqOFW8 OwRqeymagiH4mcX9D58cQEJIyJxoh2ztfkDgutUZ8zm4AtHEfZliHew2Xzqz2KJIL/Iv 980e6APoxuM3S4cRRsd+0Nz34kWiseMh/qjigoKWqPoiyh9MuXzXnF16C/qAOL3peatd d7Y32WgsohqQf+fdKQsYIYE/v870WA05Z1OP1afNnLLxMrj2T/lvxmGSjDWZEH5Mqf9C XIfA== X-Gm-Message-State: AOAM532YMbDDc/6Vv6tQC4T95Oy3fCFqQtFHC0h7PiYHf/YYg7cHZHMr X/338WUu8KS/XRkVXi3cEHv2ML4w5XmTdm5uhaGuXoc7k40qqDpb X-Google-Smtp-Source: ABdhPJwhJoowWCKn9pui2au60LYIV8VobvayaM8vYopWSoBTk5wmW0WySjeKMaZxlNoXmOSbaa4pgZBC1B2i42zTnnA= X-Received: by 2002:ac2:53ab:: with SMTP id j11mr4920717lfh.86.1605041265156; Tue, 10 Nov 2020 12:47:45 -0800 (PST) MIME-Version: 1.0 From: Steve Loughran Date: Tue, 10 Nov 2020 20:47:09 +0000 Message-ID: Subject: Hive isolation and context classloaders To: Apache Spark Dev Content-Type: multipart/alternative; boundary="00000000000075cbb405b3c6cafb" --00000000000075cbb405b3c6cafb Content-Type: text/plain; charset="UTF-8" I'm staring at https://issues.apache.org/jira/browse/HADOOP-17372 and a stack trace which claims that a com.amazonaws class doesn't implement an interface which it very much does 2020-11-10 05:27:33,517 [ScalaTest-main-running-S3DataFrameExampleSuite] WARN fs.FileSystem (FileSystem.java:createFileSystem(3466)) - Failed to initialize fileystem s3a://stevel-ireland: java.io.IOException: Class class com.amazonaws.auth.EnvironmentVariableCredentialsProvider does not implement AWSCredentialsProvider - DataFrames *** FAILED *** org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.io.IOException: Class class com.amazonaws.auth.EnvironmentVariableCredentialsProvider does not implement AWSCredentialsProvider; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) This is happening because Hive wants to instantiate the FS for a filesystem cluster (full stack in the JIRA for the curious) FileSystem.get(startSs.sessionConf); The cluster FS is set to be S3, the s3a code is building up its list of credential providers via a configuration lookup conf.getClasses("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider, org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, com.amazonaws.auth.EnvironmentVariableCredentialsProvider, org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider followed by a validation that whatever was loaded can be passed into the AWS SDK if (!AWSCredentialsProvider.class.isAssignableFrom(credClass)) { throw new IOException("Class " + credClass + " " + NOT_AWS_PROVIDER); } What appears to be happening is that the loading of the AWS credential provider is failing because that is loaded in a configuration based of the HiveConf, which uses the context class loader which was used to create that conf, so the AWS SDK class EnvironmentVariableCredentialsProvider is being loaded in the isolated classloader. But S3AFilesystem, being org.apache.hadoop code, is loading in the base classloader. As a result, it doesn't consider the EnvironmentVariableCredentialsProvider to implement the credential provider API. What to do? I could make this specific issue evaporate by just subclassing the aws SDK credential providers somewhere in o.a.h.fs.s3a and putting them on the default list, but that leaves the issue lurking for anyone else and for some other configuration-driven extension points. Anyone who uses the plugin options for the S3A and abfs connectors MUST use a class beginning org.apache.hadoop or they won't be able to init hive. Alternatively, I could ignore the context classloader and make the Configuration.getClasses() method use whatever classloader loaded the actual S3AFileSystem class. I worry that if I do that, something else is going to go horriby wrong somewhere completely random in the future. Which anything going near classloaders inevitably does, at some point. Suggestions? --00000000000075cbb405b3c6cafb Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

I'm staring at https://issues.apache.org/jira/browse/HADOOP-17= 372 and a stack trace which claims that a com.amazonaws class doesn'= ;t implement an interface which it very much does

2020-11-10 05:27:33,517 [ScalaTest-main-running-S3DataFrameExampleSu= ite] WARN =C2=A0fs.FileSystem (FileSystem.java:createFileSystem(3466)) - Fa= iled to initialize fileystem s3a://stevel-ireland: java.io.IOException: Cla= ss class com.amazonaws.auth.EnvironmentVariableCredentialsProvider does not= implement AWSCredentialsProvider
- DataFrames *** FAILED ***
=C2=A0 = org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.io= .IOException: Class class com.amazonaws.auth.EnvironmentVariableCredentials= Provider does not implement AWSCredentialsProvider;
=C2=A0 at org.apache= .spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:10= 6)

=C2=A0
This is happening because Hive wants to instantiate= the FS for a filesystem cluster (full stack in the JIRA for the curious)=C2=A0
FileSystem.get(startSs.sessionConf);<= /font>
=C2=A0 =C2=A0 =C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0
The clus= ter FS is set to be S3, the s3a code is building up its list of credential = providers via a configuration lookup

conf.g= etClasses("fs.s3a.aws.credentials.provider",
=C2=A0 "org= .apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
=C2=A0 org.apache= .hadoop.fs.s3a.SimpleAWSCredentialsProvider,
=C2=A0 com.amazonaws.auth.E= nvironmentVariableCredentialsProvider,
=C2=A0 org.apache.hadoop.fs.s3a.a= uth.IAMInstanceCredentialsProvider

followed by a validation t= hat whatever was loaded can be passed into the AWS SDK

if (!AWSCredentialsProvider.class.isAssignableFrom(credClass)) = {
=C2=A0 throw new IOException("Class " + credClass + " &= quot; + NOT_AWS_PROVIDER);
}


What appears to be happening = is that the loading of the AWS credential provider is failing because that = is loaded in a configuration based of the HiveConf, which uses the context = class loader which was used to create that conf, so the AWS SDK class Envir= onmentVariableCredentialsProvider is being loaded in the isolated classload= er. But S3AFilesystem, being org.apache.hadoop code, is loading in the base= classloader. As a result, it doesn't consider the EnvironmentVariableC= redentialsProvider to implement the credential provider API.

What to= do?

I could make this specific issue evaporate by just subclassing = the aws SDK credential providers somewhere in o.a.h.fs.s3a and putting them= on the default list, but that leaves the issue lurking for anyone else and= for some other configuration-driven extension points. Anyone who uses the = plugin options for the S3A and abfs connectors MUST use a class beginning o= rg.apache.hadoop or they won't be able to init hive.

Alternative= ly, I could ignore the context classloader and make the Configuration.getCl= asses() method use whatever classloader loaded the actual S3AFileSystem cla= ss. I worry that if I do that, something else is going to go horriby wrong = somewhere completely random in the future. Which anything going near classl= oaders inevitably does, at some point.

Suggestions?
--00000000000075cbb405b3c6cafb--