spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Guava 11 dependency issue in Spark 1.2.0
Date Mon, 19 Jan 2015 14:18:58 GMT
Please see this thread:

http://search-hadoop.com/m/LgpTk2aVYgr/Hadoop+guava+upgrade&subj=Re+Time+to+address+the+Guava+version+problem


> On Jan 19, 2015, at 6:03 AM, Romi Kuntsman <romi@totango.com> wrote:
> 
> I have recently encountered a similar problem with Guava version collision with Hadoop.
> 
> Isn't it more correct to upgrade Hadoop to use the latest Guava? Why are they staying
in version 11, does anyone know?
> 
> Romi Kuntsman, Big Data Engineer
> http://www.totango.com
> 
>> On Wed, Jan 7, 2015 at 7:59 AM, Niranda Perera <niranda.perera@gmail.com> wrote:
>> Hi Sean, 
>> 
>> I removed the hadoop dependencies from the app and ran it on the cluster. It gives
a java.io.EOFException 
>> 
>> 15/01/07 11:19:29 INFO MemoryStore: ensureFreeSpace(177166) called with curMem=0,
maxMem=2004174766
>> 15/01/07 11:19:29 INFO MemoryStore: Block broadcast_0 stored as values in memory
(estimated size 173.0 KB, free 1911.2 MB)
>> 15/01/07 11:19:29 INFO MemoryStore: ensureFreeSpace(25502) called with curMem=177166,
maxMem=2004174766
>> 15/01/07 11:19:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory
(estimated size 24.9 KB, free 1911.1 MB)
>> 15/01/07 11:19:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.100.5.109:43924
(size: 24.9 KB, free: 1911.3 MB)
>> 15/01/07 11:19:29 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
>> 15/01/07 11:19:29 INFO SparkContext: Created broadcast 0 from hadoopFile at AvroRelation.scala:45
>> 15/01/07 11:19:29 INFO FileInputFormat: Total input paths to process : 1
>> 15/01/07 11:19:29 INFO SparkContext: Starting job: collect at SparkPlan.scala:84
>> 15/01/07 11:19:29 INFO DAGScheduler: Got job 0 (collect at SparkPlan.scala:84) with
2 output partitions (allowLocal=false)
>> 15/01/07 11:19:29 INFO DAGScheduler: Final stage: Stage 0(collect at SparkPlan.scala:84)
>> 15/01/07 11:19:29 INFO DAGScheduler: Parents of final stage: List()
>> 15/01/07 11:19:29 INFO DAGScheduler: Missing parents: List()
>> 15/01/07 11:19:29 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map at SparkPlan.scala:84),
which has no missing parents
>> 15/01/07 11:19:29 INFO MemoryStore: ensureFreeSpace(4864) called with curMem=202668,
maxMem=2004174766
>> 15/01/07 11:19:29 INFO MemoryStore: Block broadcast_1 stored as values in memory
(estimated size 4.8 KB, free 1911.1 MB)
>> 15/01/07 11:19:29 INFO MemoryStore: ensureFreeSpace(3481) called with curMem=207532,
maxMem=2004174766
>> 15/01/07 11:19:29 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory
(estimated size 3.4 KB, free 1911.1 MB)
>> 15/01/07 11:19:29 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.100.5.109:43924
(size: 3.4 KB, free: 1911.3 MB)
>> 15/01/07 11:19:29 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
>> 15/01/07 11:19:29 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
>> 15/01/07 11:19:29 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[6]
at map at SparkPlan.scala:84)
>> 15/01/07 11:19:29 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
>> 15/01/07 11:19:29 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.100.5.109,
PROCESS_LOCAL, 1340 bytes)
>> 15/01/07 11:19:29 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.100.5.109,
PROCESS_LOCAL, 1340 bytes)
>> 15/01/07 11:19:29 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 10.100.5.109):
java.io.EOFException
>>     at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2722)
>>     at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1009)
>>     at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>     at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>     at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
>>     at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
>>     at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
>>     at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
>>     at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
>>     at org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)
>>     at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985)
>>     at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>     at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
>>     at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871)
>>     at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
>>     at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
>>     at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
>>     at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>>     at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
>>     at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
>>     at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
>>     at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>>     at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
>>     at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
>>     at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
>>     at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>>     at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>>     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>>     at java.lang.Thread.run(Thread.java:662)
>>  
>> 
>> I'm running the program using IDE. Not using spark-submit. Can we not submit an app
straight from the IDE to the spark cluster?
>> 
>> Cheers
>> 
>>> On Tue, Jan 6, 2015 at 3:53 PM, Sean Owen <sowen@cloudera.com> wrote:
>>> Oh, are you actually bundling Hadoop in your app? that may be the problem. If
you're using stand-alone mode, why include Hadoop? In any event, Spark and Hadoop are intended
to be 'provided' dependencies in the app you send to spark-submit.
>>> 
>>>> On Tue, Jan 6, 2015 at 10:15 AM, Niranda Perera <niranda.perera@gmail.com>
wrote:
>>>> Hi Sean, 
>>>> 
>>>> My mistake, Guava 11 dependency came from the hadoop-commons indeed. 
>>>> 
>>>> I'm running the following simple app in spark 1.2.0 standalone local cluster
(2 workers) with Hadoop 1.2.1 
>>>> 
>>>> public class AvroSparkTest {
>>>>     public static void main(String[] args) throws Exception {
>>>>         SparkConf sparkConf = new SparkConf()
>>>>                 .setMaster("spark://niranda-ThinkPad-T540p:7077") //("local[2]")
>>>>                 .setAppName("avro-spark-test");
>>>> 
>>>>         JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
>>>>         JavaSQLContext sqlContext = new JavaSQLContext(sparkContext);
>>>>         JavaSchemaRDD episodes = AvroUtils.avroFile(sqlContext,
>>>>                                                     "/home/niranda/projects/avro-spark-test/src/test/resources/episodes.avro");
>>>>         episodes.printSchema();
>>>>         episodes.registerTempTable("avroTable");
>>>>         List<Row> result = sqlContext.sql("SELECT * FROM avroTable").collect();
>>>> 
>>>>         for (Row row : result) {
>>>>             System.out.println(row.toString());
>>>>         }
>>>>     }
>>>> }
>>>> 
>>>> As you pointed out, this error occurs while adding the hadoop dependency.
this runs without a problem when the hadoop dependency is removed and the master is set to
local[].
>>>> 
>>>> Cheers
>>>> 
>>>>> On Tue, Jan 6, 2015 at 3:23 PM, Sean Owen <sowen@cloudera.com>
wrote:
>>>>> -dev
>>>>> 
>>>>> Guava was not downgraded to 11. That PR was not merged. It was part of
a discussion about, indeed, what to do about potential Guava version conflicts. Spark uses
Guava, but so does Hadoop, and so do user programs.
>>>>> 
>>>>> Spark uses 14.0.1 in fact: https://github.com/apache/spark/blob/master/pom.xml#L330
>>>>> 
>>>>> This is a symptom of conflict between Spark's Guava 14 and Hadoop's Guava
11. See for example https://issues.apache.org/jira/browse/HIVE-7387 as well.
>>>>> 
>>>>> Guava is now shaded in Spark as of 1.2.0 (and 1.1.x?), so I would think
a lot of these problems are solved. As we've seen though, this one is tricky.
>>>>> 
>>>>> What's your Spark version? and what are you executing? what mode -- standalone,
YARN? What Hadoop version?
>>>>> 
>>>>> 
>>>>>> On Tue, Jan 6, 2015 at 8:38 AM, Niranda Perera <niranda.perera@gmail.com>
wrote:
>>>>>> Hi, 
>>>>>> 
>>>>>> I have been running a simple Spark app on a local spark cluster and
I came across this error. 
>>>>>> 
>>>>>> Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
>>>>>>     at org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
>>>>>>     at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
>>>>>>     at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
>>>>>>     at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
>>>>>>     at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>>>>>>     at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
>>>>>>     at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
>>>>>>     at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
>>>>>>     at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
>>>>>>     at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
>>>>>>     at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
>>>>>>     at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
>>>>>>     at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
>>>>>>     at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:136)
>>>>>>     at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114)
>>>>>>     at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787)
>>>>>>     at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
>>>>>>     at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992)
>>>>>>     at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
>>>>>>     at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:84)
>>>>>>     at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>>>>>>     at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>>>>>>     at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>>>>>>     at org.apache.spark.SparkContext.broadcast(SparkContext.scala:945)
>>>>>>     at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:695)
>>>>>>     at com.databricks.spark.avro.AvroRelation.buildScan$lzycompute(AvroRelation.scala:45)
>>>>>>     at com.databricks.spark.avro.AvroRelation.buildScan(AvroRelation.scala:44)
>>>>>>     at org.apache.spark.sql.sources.DataSourceStrategy$.apply(DataSourceStrategy.scala:56)
>>>>>>     at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>>>>>>     at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>>>>>>     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>>>>>     at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>>>>>>     at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
>>>>>>     at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
>>>>>>     at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
>>>>>>     at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
>>>>>>     at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
>>>>>>     at org.apache.spark.sql.api.java.JavaSchemaRDD.collect(JavaSchemaRDD.scala:114)
>>>>>> 
>>>>>> 
>>>>>> While looking into this I found out that Guava was downgraded to
version 11 in this PR. 
>>>>>> https://github.com/apache/spark/pull/1610
>>>>>> 
>>>>>> In this PR OpenHashSet.scala:261 line hashInt has been changed to
hashLong. 
>>>>>> But when I actually run my app,  "java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt"
error occurs, 
>>>>>> which is understandable because hashInt is not available before Guava
12.
>>>>>> 
>>>>>> So, I''m wondering why this occurs? 
>>>>>> 
>>>>>> Cheers
>>>>>> -- 
>>>>>> Niranda Perera
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Niranda
>> 
>> 
>> 
>> -- 
>> Niranda
> 

Mime
View raw message