spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: How to use a custom filesystem provider?
Date Wed, 21 Sep 2016 19:15:42 GMT

On 21 Sep 2016, at 20:10, Jean-Philippe Martin <jpmartin@google.com.INVALID<mailto:jpmartin@google.com.INVALID>>
wrote:


The full source for my example is available on github<https://github.com/jean-philippe-martin/SparkRepro>.

I'm using maven to depend on gcloud-java-nio<https://mvnrepository.com/artifact/com.google.cloud/gcloud-java-nio/0.2.5>,
which provides a Java FileSystem for Google Cloud Storage, via "gs://" URLs. My Spark project
uses maven-shade-plugin to create one big jar with all the source in it.

The big jar correctly includes a META-INF/services/java.nio.file.spi.FileSystemProviderfile,
containing the correct name for the class (com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider).
I checked and that class is also correctly included in the jar file.

The program uses FileSystemProvider.installedProviders() to list the filesystem providers
it finds. "gs" should be listed (and it is if I run the same function in a non-Spark context),
but when running with Spark on Dataproc, that provider's gone.

I'd like to know: How can I use a custom filesystem in my Spark program?



There's a bit of confusion setting in here; the FileSystem implementations spark uses are
subclasses of org.apache.hadoop.fs.FileSystem; the nio class with the same name is different.

grab the google cloud storage connector and put it on your classpath

https://cloud.google.com/hadoop/google-cloud-storage-connector
https://github.com/GoogleCloudPlatform/bigdata-interop


Mime
View raw message