spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Issues when combining Spark and a third party java library
Date Mon, 26 Jan 2015 15:08:05 GMT
Its more like, Spark is not able to find the hadoop jars. Try setting the
HADOOP_CONF_DIR and also make sure *-site.xml are available in the
CLASSPATH/SPARK_CLASSPATH.

Thanks
Best Regards

On Mon, Jan 26, 2015 at 7:28 PM, Staffan <staffan.arvidsson@gmail.com>
wrote:

> I'm using Maven and Eclipse to build my project. I'm letting Maven download
> all the things I need for running everything, which has worked fine up
> until
> now. I need to use the CDK library (https://github.com/egonw/cdk,
> http://sourceforge.net/projects/cdk/) and once I add the dependencies to
> my
> pom.xml Spark starts to complain (this is without calling any function or
> importing any new library into my code, only by introducing new
> dependencies
> to the pom.xml). Trying to set up a SparkContext give me the following
> errors in the log:
>
> [main] DEBUG org.apache.spark.rdd.HadoopRDD - SplitLocationInfo and other
> new Hadoop classes are unavailable. Using the older Hadoop location info
> code.
> java.lang.ClassNotFoundException:
> org.apache.hadoop.mapred.InputSplitWithLocationInfo
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:191)
> at
>
> org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.<init>(HadoopRDD.scala:381)
> at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:391)
> at org.apache.spark.rdd.HadoopRDD$.<init>(HadoopRDD.scala:390)
> at org.apache.spark.rdd.HadoopRDD$.<clinit>(HadoopRDD.scala)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:159)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
> at org.apache.spark.rdd.RDD.foreach(RDD.scala:765)
>
> later in the log:
> [Executor task launch worker-0] DEBUG
> org.apache.spark.deploy.SparkHadoopUtil - Couldn't find method for
> retrieving thread-level FileSystem input data
> java.lang.NoSuchMethodException:
> org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()
> at java.lang.Class.getDeclaredMethod(Class.java:2009)
> at org.apache.spark.util.Utils$.invoke(Utils.scala:1733)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:138)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> There has also been issues related to "HADOOP_HOME" not being set etc., but
> which seems to be intermittent and only occur sometimes.
>
>
> After testing different versions of both CDK and Spark, I've found out that
> the Spark version 0.9.1 and earlier DO NOT have this problem, so there is
> something in the newer versions of Spark that do not play well with
> others... However, I need the functionality in the later versions of Spark
> so this do not solve my problem. Anyone willing to try to reproduce the
> issue can do so by adding the dependencies for CDK:
>
> <dependency>
> <groupId>org.openscience.cdk</groupId>
> <artifactId>cdk-fingerprint</artifactId>
> <version>1.5.10</version>
> </dependency>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-combining-Spark-and-a-third-party-java-library-tp21367.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message