spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@apache.org>
Subject Re: Important: Changes to Spark's build system on master branch
Date Wed, 21 Aug 2013 23:50:04 GMT
I hear you guys - and I am well aware about the differences between the two.
However, actual Hadoop2 doesn't even have such thing as MR1 - this is why
profile naming is misleading. What you see under the current profile 'hadoop2'
is essentially a commercial hack, that doesn't exist anywhere beyond CDH
artifacts (and event there not for long).

Besides, YARN != MR2 :) YARN is a resource manager that, among other things,
provides for running MR applications on it.

We can argue about semantics till blue in the face, but the reality is simple:
current 'hadoop2' profile doesn't reflect Hadoop2 facts. That's my only point.

Cos

On Wed, Aug 21, 2013 at 01:20PM, Jey Kottalam wrote:
> As Mridul points out, the old "hadoop1" and "hadoop2" terminology
> referred to the versions of certain interfaces and classes within
> Hadoop. With these latest changes we have unified the handling of both
> hadoop1 and hadoop2 interfaces so that the build is agnostic to the
> exact Hadoop version available at runtime.
> 
> However, the distinction between YARN-enabled and non-YARN builds does
> still exist. I propose that we retroactively reinterpret
> "hadoop2-yarn" as shorthand for "Hadoop MapReduce v2 (aka YARN)".
> 
> -Jey
> 
> On Wed, Aug 21, 2013 at 1:04 PM, Mridul Muralidharan <mridul@gmail.com> wrote:
> > hadoop2, in this context, is use of spark on a hadoop cluster without
> > yarn but with hadoop2 interfaces.
> > hadoop2-yarn uses yarn RM to launch a spark job (and obviously uses
> > hadoop2 interfaces).
> >
> > Regards,
> > Mridul
> >
> > On Wed, Aug 21, 2013 at 11:52 PM, Konstantin Boudnik <cos@apache.org> wrote:
> >> For what it worth guys - hadoop2 profile content is misleading: CDH isn't
> >> Hadoop2: it has 1354 patches on top of Hadoop2 alpha.
> >>
> >> What is called hadoop2-yarn is actually hadoop2. Perhaps, while we are at it
> >> the profiles need to be renamed. I can supply the patch if the community is
ok
> >> with it.
> >>
> >> Cos
> >>
> >> On Tue, Aug 20, 2013 at 11:36PM, Andy Konwinski wrote:
> >>> Hey Jey,
> >>>
> >>> I'd just like to add that you can also run hadoop2 without modifying the
> >>> pom.xml file by passing the hadoop.version property at the command line
> >>> like this:
> >>>
> >>> mvn -Dhadoop.version=2.0.0-mr1-cdh4.1.2 clean verify
> >>>
> >>> Also, when you mentioned building with Maven in your instructions I think
> >>> you forgot to finish writing out your example for activating the yarn
> >>> profile, which I think would be something like:
> >>>
> >>> mvn -Phadoop2-yarn clean verify
> >>>
> >>> ...right?
> >>>
> >>> BTW, I've set up the AMPLab Jenkins Spark Maven Hadoop2 project to build
> >>> using the new options
> >>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-Hadoop2/
> >>>
> >>> Andy
> >>>
> >>> On Tue, Aug 20, 2013 at 8:39 PM, Jey Kottalam <jey@cs.berkeley.edu>
wrote:
> >>>
> >>> > The master branch of Spark has been updated with PR #838, which
> >>> > changes aspects of Spark's interface to Hadoop. This involved also
> >>> > making changes to Spark's build system as documented below. The
> >>> > documentation will be updated with this information shortly.
> >>> >
> >>> > Please feel free to reply to this thread with any questions or if you
> >>> > encounter any problems.
> >>> >
> >>> > -Jey
> >>> >
> >>> >
> >>> >
> >>> > When Building Spark
> >>> > ===============
> >>> >
> >>> > - General: The default version of Hadoop has been updated to 1.2.1
from
> >>> > 1.0.4.
> >>> >
> >>> > - General: You will probably need to perform an "sbt clean" or "mvn
> >>> > clean" to remove old build files. SBT users may also need to perform
a
> >>> > "clean" when changing Hadoop versions (or at least delete the
> >>> > lib_managed directory).
> >>> >
> >>> > - SBT users: The version of Hadoop used can be specified by setting
> >>> > the SPARK_HADOOP_VERSION environment variable when invoking sbt, and
> >>> > YARN-enabled builds can be created by setting SPARK_WITH_YARN=true.
> >>> > Example:
> >>> >
> >>> >     # Using Hadoop 1.1.0 (a version of Hadoop without YARN)
> >>> >     SPARK_HADOOP_VERSION=1.1.0 ./sbt/sbt package assembly
> >>> >
> >>> >     # Using Hadoop 2.0.5-alpha (which is a YARN-based version of Hadoop)
> >>> >     SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt
> >>> > package assembly
> >>> >
> >>> > - Maven users: Set the Hadoop version built against by editing the
> >>> > "pom.xml" file in the root directory and changing the "hadoop.version"
> >>> > property (and, the "yarn.version" property if applicable). If you are
> >>> > building with YARN disabled, you no longer need to enable any Maven
> >>> > profiles (i.e. "-P" flags). To build with YARN enabled, use the
> >>> > "hadoop2-yarn" Maven profile. Example:
> >>> >
> >>> > - The "make-distribution.sh" script has been updated to take
> >>> > additional parameters to select the Hadoop version and enable YARN.
> >>> >
> >>> >
> >>> >
> >>> > When Writing Spark Applications
> >>> > ========================
> >>> >
> >>> >
> >>> > - Non-YARN users: If you wish to use HDFS, you will need to add the
> >>> > appropriate version of the "hadoop-client" artifact from the
> >>> > "org.apache.hadoop" group to your project.
> >>> >
> >>> >     SBT example:
> >>> >         // "force()" is required because "1.1.0" is less than Spark's
> >>> > default of "1.2.1"
> >>> >         "org.apache.hadoop" % "hadoop-client" % "1.1.0" force()
> >>> >
> >>> >     Maven example:
> >>> >         <dependency>
> >>> >           <groupId>org.apache.hadoop</groupId>
> >>> >           <artifactId>hadoop-client</artifactId>
> >>> >           <!-- the brackets are needed to tell Maven that this is
a
> >>> > hard dependency on version "1.1.0" exactly -->
> >>> >           <version>[1.1.0]</version>
> >>> >         </dependency>
> >>> >
> >>> >
> >>> > - YARN users: You will now need to set SPARK_JAR to point to the
> >>> > spark-yarn assembly instead of the spark-core assembly previously
> >>> > used.
> >>> >
> >>> >   SBT Example:
> >>> >        SPARK_JAR=$PWD/yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar
\
> >>> >         ./run spark.deploy.yarn.Client \
> >>> >           --jar
> >>> > $PWD/examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.0-SNAPSHOT.jar
> >>> > \
> >>> >           --class spark.examples.SparkPi --args yarn-standalone \
> >>> >           --num-workers 3 --worker-memory 2g --master-memory 2g
> >>> > --worker-cores 1
> >>> >

Mime
View raw message