spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devi P V <devipvina...@gmail.com>
Subject Write pyspark dataframe into kms encrypted s3 bucket
Date Thu, 15 Oct 2020 12:17:38 GMT
Hi All,

I am trying to write a pyspark dataframe into KMS encrypted S3 bucket.I am
using spark-3.0.1-bin-hadoop3.2. I have given all the possible
configurations as shown below.

sc = spark.sparkContext
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.access.key", "XXX")
hadoop_conf.set("fs.s3a.secret.key","XXX")
hadoop_conf.set("fs.s3a.multipart.size", "104857600")
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.setBoolean("fs.s3a.sse.enabled",True)
hadoop_conf.set("fs.s3a.server-side-encryption-algorithm", "SSE-KMS")
hadoop_conf.set("fs.s3a.sse.kms.keyId", "XXXX")



df = spark.createDataFrame(
    [
        (1, 'one'),
        (2, 'two'),
    ],
    ['id', 'txt']
)
df.write.csv('s3a://bucket_name/test_data',header='true')

Getting exception

: java.lang.IllegalArgumentException
        at java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1314)
        at java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1237)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:274)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
        at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:459)
        at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:559)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
        at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:953)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

Any idea to resolve this issue ?

Thanks

Mime
View raw message