spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukas Nalezenec <lukas.naleze...@firma.seznam.cz>
Subject strange StreamCorruptedException
Date Thu, 17 Apr 2014 14:39:23 GMT
Hi all,

I am running algorithm similar to wordcount and I am not sure why it 
fails at end, there are only 200 words so result of the computation 
should be small.

I have got SIMR command line with Spark 0.8.1 , 50 workers each with 
~512M RAM.
The dataset is 100 GB tab separated text HadoopRDD, it has ~6000 partitions.

My command line is:
dataset.map(x => x.split("\t")).map(x => (x(2), 
x(3).toInt)).reduceByKey(_ + _).collect

It throws this exception:

java.io.StreamCorruptedException (java.io.StreamCorruptedException: 
invalid type code: AC)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:101)org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:26)org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:53)org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:95)

What am I doing wrong ?

Thanks!
Best Regards
Lukas

Mime
View raw message