spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-32766) s3a: bucket names with dots cannot be used
Date Wed, 02 Sep 2020 01:26:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-32766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-32766.
----------------------------------
    Resolution: Invalid

This doesn't looks like a problem in Spark.

> s3a: bucket names with dots cannot be used
> ------------------------------------------
>
>                 Key: SPARK-32766
>                 URL: https://issues.apache.org/jira/browse/SPARK-32766
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.0.0
>            Reporter: Ondrej Kokes
>            Priority: Minor
>
> Running vanilla spark with
> {noformat}
> --packages=org.apache.hadoop:hadoop-aws:x.y.z{noformat}
> I cannot read from S3, if the bucket name contains a dot (a valid name).
> A minimal reproducible example looks like this
> {{from pyspark.sql import SparkSession}}
> {{import pyspark.sql.functions as f}}
> {{if __name__ == '__main__':}}
> {{  spark = (SparkSession}}
> {{    .builder}}
> {{    .appName('my_app')}}
> {{    .master("local[*]")}}
> {{    .getOrCreate()}}
> {{  )}}
> {{  spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")}}
> Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read that CSV.
I created the same bucket without the period and it worked fine.
> *Now I'm not sure whether this is a thing of prepping the path names and passing them
to the aws-sdk, or whether the fault is within the SDK itself. I am not Java savvy to investigate
the issue further, but I tried to make the repro as short as possible.*
> ----
> I get different errors depending on which Hadoop distributions I use. If I use the default
PySpark distribution (which includes Hadoop 2), I get the following (using hadoop-aws:2.7.4)
> {{scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()}}
> {{java.lang.IllegalArgumentException: The bucketName parameter must be specified.}}
> {{ at com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}}
> {{ at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}}
> {{ at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}}
> {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}}
> {{ at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
> {{ at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
> {{ at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
> {{ at scala.Option.getOrElse(Option.scala:189)}}
> {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
> {{ ... 47 elided}}
> When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this error (with
hadoop-aws:3.2.0):
> {{java.lang.NullPointerException: null uri host.}}
> {{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}}
> {{ at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}}
> {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}}
> {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}}
> {{ at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
> {{ at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
> {{ at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
> {{ at scala.Option.getOrElse(Option.scala:189)}}
> {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
> {{ ... 47 elided}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message