spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jey Kottalam <...@cs.berkeley.edu>
Subject Re: Important: Changes to Spark's build system on master branch
Date Wed, 21 Aug 2013 20:20:05 GMT
As Mridul points out, the old "hadoop1" and "hadoop2" terminology
referred to the versions of certain interfaces and classes within
Hadoop. With these latest changes we have unified the handling of both
hadoop1 and hadoop2 interfaces so that the build is agnostic to the
exact Hadoop version available at runtime.

However, the distinction between YARN-enabled and non-YARN builds does
still exist. I propose that we retroactively reinterpret
"hadoop2-yarn" as shorthand for "Hadoop MapReduce v2 (aka YARN)".

-Jey

On Wed, Aug 21, 2013 at 1:04 PM, Mridul Muralidharan <mridul@gmail.com> wrote:
> hadoop2, in this context, is use of spark on a hadoop cluster without
> yarn but with hadoop2 interfaces.
> hadoop2-yarn uses yarn RM to launch a spark job (and obviously uses
> hadoop2 interfaces).
>
> Regards,
> Mridul
>
> On Wed, Aug 21, 2013 at 11:52 PM, Konstantin Boudnik <cos@apache.org> wrote:
>> For what it worth guys - hadoop2 profile content is misleading: CDH isn't
>> Hadoop2: it has 1354 patches on top of Hadoop2 alpha.
>>
>> What is called hadoop2-yarn is actually hadoop2. Perhaps, while we are at it
>> the profiles need to be renamed. I can supply the patch if the community is ok
>> with it.
>>
>> Cos
>>
>> On Tue, Aug 20, 2013 at 11:36PM, Andy Konwinski wrote:
>>> Hey Jey,
>>>
>>> I'd just like to add that you can also run hadoop2 without modifying the
>>> pom.xml file by passing the hadoop.version property at the command line
>>> like this:
>>>
>>> mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 clean verify
>>>
>>> Also, when you mentioned building with Maven in your instructions I think
>>> you forgot to finish writing out your example for activating the yarn
>>> profile, which I think would be something like:
>>>
>>> mvn -Phadoop2-yarn clean verify
>>>
>>> ...right?
>>>
>>> BTW, I've set up the AMPLab Jenkins Spark Maven Hadoop2 project to build
>>> using the new options
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Hadoop2/
>>>
>>> Andy
>>>
>>> On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <jey@cs.berkeley.edu> wrote:
>>>
>>> > The master branch of Spark has been updated with PR #838, which
>>> > changes aspects of Spark's interface to Hadoop. This involved also
>>> > making changes to Spark's build system as documented below. The
>>> > documentation will be updated with this information shortly.
>>> >
>>> > Please feel free to reply to this thread with any questions or if you
>>> > encounter any problems.
>>> >
>>> > -Jey
>>> >
>>> >
>>> >
>>> > When Building Spark
>>> > ===============
>>> >
>>> > - General: The default version of Hadoop has been updated to 1.2.1 from
>>> > 1.0.4.
>>> >
>>> > - General: You will probably need to perform an "sbt clean" or "mvn
>>> > clean" to remove old build files. SBT users may also need to perform a
>>> > "clean" when changing Hadoop versions (or at least delete the
>>> > lib_managed directory).
>>> >
>>> > - SBT users: The version of Hadoop used can be specified by setting
>>> > the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
>>> > YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
>>> > Example:
>>> >
>>> >     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
>>> >     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
>>> >
>>> >     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
>>> >     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
>>> > package assembly
>>> >
>>> > - Maven users: Set the Hadoop version built against by editing the
>>> > "pom.xml" file in the root directory and changing the "hadoop.version"
>>> > property (and, the "yarn.version" property if applicable). If you are
>>> > building with YARN disabled, you no longer need to enable any Maven
>>> > profiles (i.e. "-P" flags). To build with YARN enabled, use the
>>> > "hadoop2-yarn" Maven profile. Example:
>>> >
>>> > - The "make-distribution.sh" script has been updated to take
>>> > additional parameters to select the Hadoop version and enable YARN.
>>> >
>>> >
>>> >
>>> > When Writing Spark Applications
>>> > ========================
>>> >
>>> >
>>> > - Non-YARN users: If you wish to use HDFS, you will need to add the
>>> > appropriate version of the "hadoop-client" artifact from the
>>> > "org.apache.hadoop" group to your project.
>>> >
>>> >     SBT example:
>>> >         // "force()" is required because "1.1.0" is less than Spark's
>>> > default of "1.2.1"
>>> >         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
>>> >
>>> >     Maven example:
>>> >         <dependency>
>>> >           <groupId>org.apache.hadoop</groupId>
>>> >           <artifactId>hadoop-client</artifactId>
>>> >           <!-- the brackets are needed to tell Maven that this is a
>>> > hard dependency on version "1.1.0" exactly -->
>>> >           <version>[1.1.0]</version>
>>> >         </dependency>
>>> >
>>> >
>>> > - YARN users: You will now need to set SPARK_JAR to point to the
>>> > spark-yarn assembly instead of the spark-core assembly previously
>>> > used.
>>> >
>>> >   SBT Example:
>>> >        SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar
\
>>> >         ./run spark.deploy.yarn.Client \
>>> >           --jar
>>> > $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
>>> > \
>>> >           --class spark.examples.SparkPi --args yarn-standalone \
>>> >           --num-workers 3 --worker-memory 2g --master-memory 2g
>>> > --worker-cores 1
>>> >

Mime
View raw message