Hi all, I never heard from anyone on this and have received emails in private that people would like to add terasort to their spark-perf installs so it becomes part of their cluster validation checks. Yours, Ewan -------- Forwarded Message -------- Subject: SparkSpark-perf terasort WIP branch Date: Wed, 14 Jan 2015 14:33:45 +0100 From: Ewan Higgs To: dev@spark.apache.org Hi all, I'm trying to build the Spark-perf WIP code but there are some errors to do with Hadoop APIs. I presume this is because there is some Hadoop version set and it's referring to that. But I can't seem to find it. The errors are as follows: [info] Compiling 15 Scala sources and 2 Java sources to /home/ehiggs/src/spark-perf/spark-tests/target/scala-2.10/classes... [error] /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/spark/perf/terasort/TeraInputFormat.scala:40: object task is not a member of package org.apache.hadoop.mapreduce [error] import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl [error] ^ [error] /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/spark/perf/terasort/TeraInputFormat.scala:132: not found: type TaskAttemptContextImpl [error] val context = new TaskAttemptContextImpl( [error] ^ [error] /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/spark/perf/terasort/TeraScheduler.scala:37: object TTConfig is not a member of package org.apache.hadoop.mapreduce.server.tasktracker [error] import org.apache.hadoop.mapreduce.server.tasktracker.TTConfig [error] ^ [error] /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/spark/perf/terasort/TeraScheduler.scala:91: not found: value TTConfig [error] var slotsPerHost : Int = conf.getInt(TTConfig.TT_MAP_SLOTS, 4) [error] ^ [error] /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/spark/perf/terasort/TeraSortAll.scala:7: value run is not a member of org.apache.spark.examples.terasort.TeraGen [error] tg.run(Array[String]("10M", "/tmp/terasort_in")) [error] ^ [error] /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/spark/perf/terasort/TeraSortAll.scala:9: value run is not a member of org.apache.spark.examples.terasort.TeraSort [error] ts.run(Array[String]("/tmp/terasort_in", "/tmp/terasort_out")) [error] ^ [error] 6 errors found [error] (compile:compile) Compilation failed [error] Total time: 13 s, completed 05-Jan-2015 12:21:47 I can build the same code if it's in the Spark tree using the following command: mvn -Dhadoop.version=2.5.0 -DskipTests=true install Is there a way I can convince spark-perf to build this code with the appropriate Hadoop library version? I tried to apply the following to spark-tests/project/SparkTestsBuild.scala but it didn't seem to work as I expected: $ git diff project/SparkTestsBuild.scala diff --git a/spark-tests/project/SparkTestsBuild.scala b/spark-tests/project/SparkTestsBuild.scala index 4116326..4ed5f0c 100644 --- a/spark-tests/project/SparkTestsBuild.scala +++ b/spark-tests/project/SparkTestsBuild.scala @@ -16,7 +16,9 @@ object SparkTestsBuild extends Build { "org.scalatest" %% "scalatest" % "2.2.1" % "test", "com.google.guava" % "guava" % "14.0.1", "org.apache.spark" %% "spark-core" % "1.0.0" % "provided", - "org.json4s" %% "json4s-native" % "3.2.9" + "org.json4s" %% "json4s-native" % "3.2.9", + "org.apache.hadoop" % "hadoop-common" % "2.5.0", + "org.apache.hadoop" % "hadoop-mapreduce" % "2.5.0" ), test in assembly := {}, outputPath in assembly := file("target/spark-perf-tests-assembly.jar"), @@ -36,4 +38,4 @@ object SparkTestsBuild extends Build { case _ => MergeStrategy.first } )) -} \ No newline at end of file +} Yours, Ewan