spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Role-based S3 access outside of EMR
Date Sun, 14 Aug 2016 20:15:24 GMT

On 29 Jul 2016, at 00:07, Everett Anderson <everett@nuna.com.INVALID<mailto:everett@nuna.com.invalid>>
wrote:

Hey,

Just wrapping this up --

I ended up following the instructions<https://spark.apache.org/docs/1.6.2/building-spark.html>
to build a custom Spark release with Hadoop 2.7.2, stealing from Steve's SPARK-7481 PR a bit,
in order to get Spark 1.6.2 + Hadoop 2.7.2 + the hadoop-aws library (which pulls in the proper
AWS Java SDK dependency).

Now that there's an official Spark 2.0 + Hadoop 2.7.x release, this is probably no longer
necessary, but I haven't tried it, yet.


you still need need to get the hadoop-aws and compatible JARs into your lib dir; the SPARK-7481
patch does that and gets the hadoop-aws JAR it into spark-assembly JAR, something which isn't
directly relevant for spark 2.

The PR is still tagged as WiP pending the release of Hadoop 2.7.3, which will swallow classload
exceptions when enumerating filesystem clients declared in JARs ... without that the presence
of hadoop-aws or hadoop-azure on the classpath *without the matching amazon or azure JARs*
will cause startup to fail.


With the custom release, s3a paths work fine with EC2 role credentials without doing anything
special. The only thing I had to do was to add this extra --conf flag to spark-submit in order
to write to encrypted S3 buckets --

    --conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256


I'd really like to know what performance difference you are seeing working with server-side
encryption and different file formats; can you do any tests using encrypted and unencrypted
copies of the same datasets and see how the times come out?


Full instructions for building on Mac are here:

1) Download the Spark 1.6.2 source from https://spark.apache.org/downloads.html

2) Install R

brew tap homebrew/science
brew install r

3) Set JAVA_HOME and the MAVEN_OPTS as in the instructions

4) Modify the root pom.xml to add a hadoop-2.7 profile (mostly stolen from Spark 2.0)

    <profile>
      <id>hadoop-2.7</id>
      <properties>
        <hadoop.version>2.7.2</hadoop.version>
        <jets3t.version>0.9.3</jets3t.version>
        <zookeeper.version>3.4.6</zookeeper.version>
        <curator.version>2.6.0</curator.version>
      </properties>
      <dependencyManagement>
        <dependencies>
          <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-aws</artifactId>
            <version>${hadoop.version}</version>
            <scope>${hadoop.deps.scope}</scope>
            <exclusions>
              <exclusion>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
              </exclusion>
              <exclusion>
                <groupId>commons-logging</groupId>
                <artifactId>commons-logging</artifactId>
              </exclusion>
            </exclusions>
          </dependency>
        </dependencies>
      </dependencyManagement>
    </profile>

5) Modify core/pom.xml to include the corresponding hadoop-aws and AWS SDK libs

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <exclusions>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-common</artifactId>
        </exclusion>
        <exclusion>
          <groupId>commons-logging</groupId>
          <artifactId>commons-logging</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

6) Build with

./make-distribution.sh --name custom-hadoop-2.7-2-aws-s3a --tgz -Psparkr -Phadoop-2.7 -Phive
-Phive-thriftserver -Pyarn




Mime
View raw message