spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rok Roskar <rokros...@gmail.com>
Subject Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile
Date Wed, 28 Jan 2015 09:30:43 GMT
hi, thanks for the quick answer -- I suppose this is possible, though I
don't understand how it could come about. The largest individual RDD
elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of
800k of them. The file is saved in 134 parts, but is being read in using
some 1916+ partitions (I don't know why actually -- how does this number
come about?). How can I check if any objects/batches are exceeding 2Gb?

Thanks,

Rok


On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu <davies@databricks.com> wrote:

> Maybe it's caused by integer overflow, is it possible that one object
> or batch bigger than 2G (after pickling)?
>
> On Tue, Jan 27, 2015 at 7:59 AM, rok <rokroskar@gmail.com> wrote:
> > I've got an dataset saved with saveAsPickleFile using pyspark -- it saves
> > without problems. When I try to read it back in, it fails with:
> >
> > Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times,
> most
> > recent failure: Lost task 401.3 in stage 0.0 (TID 449,
> > e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException:
> >
> > org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
> >         org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98)
> >
> > org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153)
> >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
> >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
> >
> >
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875)
> >
> >
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848)
> >
> >
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
> >
> >
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
> >
>  org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)
> >
>  org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)
> >         org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> >
> >
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> >         scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> >
> >
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330)
> >
> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
> >
> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> >
> >
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> >
>  org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> >
> >
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
> >
> >
> > Not really sure where to start looking for the culprit -- any suggestions
> > most welcome. Thanks!
> >
> > Rok
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>

Mime
View raw message