spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit
Date Wed, 16 Nov 2016 17:32:25 GMT
It seems a bit weird. Could we open an issue and talk in the repository
link I sent?

Let me try to reproduce your case with your data if possible.

On 17 Nov 2016 2:26 a.m., "Arun Patel" <arunp.bigdata@gmail.com> wrote:

> I tried below options.
>
> 1) Increase executor memory.  Increased up to maximum possibility 14GB.
> Same error.
> 2) Tried new version - spark-xml_2.10:0.4.1.  Same error.
> 3) Tried with low level rowTags.  It worked for lower level rowTag and
> returned 16000 rows.
>
> Are there any workarounds for this issue?  I tried playing with spark.memory.fraction
> and spark.memory.storageFraction.  But, it did not help.  Appreciate your
> help on this!!!
>
>
>
> On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel <arunp.bigdata@gmail.com>
> wrote:
>
>> Thanks for the quick response.
>>
>> Its a single XML file and I am using a top level rowTag.  So, it creates
>> only one row in a Dataframe with 5 columns. One of these columns will
>> contain most of the data as StructType.  Is there a limitation to store
>> data in a cell of a Dataframe?
>>
>> I will check with new version and try to use different rowTags and
>> increase executor-memory tomorrow. I will open a new issue as well.
>>
>>
>>
>> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls223@gmail.com>
>> wrote:
>>
>>> Hi Arun,
>>>
>>>
>>> I have few questions.
>>>
>>> Dose your XML file have like few huge documents? In this case of a row
>>> having a huge size like (like 500MB), it would consume a lot of memory
>>>
>>> becuase at least it should hold a row to iterate if I remember
>>> correctly. I remember this happened to me before while processing a huge
>>> record for test purpose.
>>>
>>>
>>> How about trying to increase --executor-memory?
>>>
>>>
>>> Also, you could try to select only few fields to prune the data with the
>>> latest version just to doubly sure if you don't mind?.
>>>
>>>
>>> Lastly, do you mind if I ask to open an issue in
>>> https://github.com/databricks/spark-xml/issues if you still face this
>>> problem?
>>>
>>> I will try to take a look at my best.
>>>
>>>
>>> Thank you.
>>>
>>>
>>> 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigdata@gmail.com>:
>>>
>>>> I am trying to read an XML file which is 1GB is size.  I am getting an
>>>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM
>>>> limit' after reading 7 partitions in local mode.  In Yarn mode, it
>>>> throws 'java.lang.OutOfMemoryError: Java heap space' error after
>>>> reading 3 partitions.
>>>>
>>>> Any suggestion?
>>>>
>>>> PySpark Shell Command:    pyspark --master local[4] --driver-memory 3G
>>>> --jars / tmp/spark-xml_2.10-0.3.3.jar
>>>>
>>>>
>>>>
>>>> Dataframe Creation Command:   df = sqlContext.read.format('com.da
>>>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>>>>
>>>>
>>>>
>>>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0
>>>> (TID 1) in 25978 ms on localhost (1/10)
>>>>
>>>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
>>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>>>>
>>>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID
>>>> 2). 2309 bytes result sent to driver
>>>>
>>>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0
>>>> (TID 3, localhost, partition 3,ANY, 2266 bytes)
>>>>
>>>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>>>>
>>>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0
>>>> (TID 2) in 51001 ms on localhost (2/10)
>>>>
>>>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
>>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>>>>
>>>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID
>>>> 3). 2309 bytes result sent to driver
>>>>
>>>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0
>>>> (TID 4, localhost, partition 4,ANY, 2266 bytes)
>>>>
>>>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>>>>
>>>> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0
>>>> (TID 3) in 24336 ms on localhost (3/10)
>>>>
>>>> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
>>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>>>>
>>>> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID
>>>> 4). 2309 bytes result sent to driver
>>>>
>>>> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0
>>>> (TID 5, localhost, partition 5,ANY, 2266 bytes)
>>>>
>>>> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
>>>>
>>>> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0
>>>> (TID 4) in 20895 ms on localhost (4/10)
>>>>
>>>> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
>>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728
>>>>
>>>> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID
>>>> 5). 2309 bytes result sent to driver
>>>>
>>>> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0
>>>> (TID 6, localhost, partition 6,ANY, 2266 bytes)
>>>>
>>>> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
>>>>
>>>> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0
>>>> (TID 5) in 20793 ms on localhost (5/10)
>>>>
>>>> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
>>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728
>>>>
>>>> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID
>>>> 6). 2309 bytes result sent to driver
>>>>
>>>> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0
>>>> (TID 7, localhost, partition 7,ANY, 2266 bytes)
>>>>
>>>> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
>>>>
>>>> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0
>>>> (TID 6) in 21306 ms on localhost (6/10)
>>>>
>>>> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split:
>>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728
>>>>
>>>> 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID
>>>> 7). 2309 bytes result sent to driver
>>>>
>>>> 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0
>>>> (TID 8, localhost, partition 8,ANY, 2266 bytes)
>>>>
>>>> 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
>>>>
>>>> 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0
>>>> (TID 7) in 21130 ms on localhost (7/10)
>>>>
>>>> 16/11/15 18:29:43 INFO NewHadoopRDD: Input split:
>>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728
>>>>
>>>> 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0
>>>> (TID 0)
>>>>
>>>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>>>>
>>>>         at java.util.Arrays.copyOf(Arrays.java:2271)
>>>>
>>>>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.jav
>>>> a:113)
>>>>
>>>>         at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput
>>>> Stream.java:93)
>>>>
>>>>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.ja
>>>> va:122)
>>>>
>>>>         at java.io.DataOutputStream.write(DataOutputStream.java:88)
>>>>
>>>>         at com.databricks.spark.xml.XmlRecordReader.readUntilMatch(XmlI
>>>> nputFormat.scala:188)
>>>>
>>>>         at com.databricks.spark.xml.XmlRecordReader.next(XmlInputFormat
>>>> .scala:156)
>>>>
>>>>         at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(XmlInp
>>>> utFormat.scala:141)
>>>>
>>>>         at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopR
>>>> DD.scala:168)
>>>>
>>>>         at org.apache.spark.InterruptibleIterator.hasNext(Interruptible
>>>> Iterator.scala:39)
>>>>
>>>>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:32
>>>> 7)
>>>>
>>>>         at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:37
>>>> 1)
>>>>
>>>>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>>
>>>>         at scala.collection.AbstractIterator.foreach(Iterator.scala:115
>>>> 7)
>>>>
>>>>         at scala.collection.TraversableOnce$class.foldLeft(TraversableO
>>>> nce.scala:144)
>>>>
>>>>         at scala.collection.AbstractIterator.foldLeft(Iterator.scala:11
>>>> 57)
>>>>
>>>>         at scala.collection.TraversableOnce$class.aggregate(Traversable
>>>> Once.scala:201)
>>>>
>>>>         at scala.collection.AbstractIterator.aggregate(Iterator.scala:1
>>>> 157)
>>>>
>>>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>>>> 4.apply(RDD.scala:1142)
>>>>
>>>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>>>> 4.apply(RDD.scala:1142)
>>>>
>>>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>>>> 5.apply(RDD.scala:1143)
>>>>
>>>>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2
>>>> 5.apply(RDD.scala:1143)
>>>>
>>>>         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a
>>>> pply$22.apply(RDD.scala:717)
>>>>
>>>>         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a
>>>> pply$22.apply(RDD.scala:717)
>>>>
>>>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>>>> DD.scala:38)
>>>>
>>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3
>>>> 13)
>>>>
>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>>>>
>>>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>>>> DD.scala:38)
>>>>
>>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3
>>>> 13)
>>>>
>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>>>>
>>>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>>>> Task.scala:73)
>>>>
>>>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>>>> Task.scala:41)
>>>>
>>>> 16/11/15 18:29:48 ERROR SparkUncaughtExceptionHandler: Uncaught
>>>> exception in thread Thread[Executor task launch worker-0,5,main]
>>>>
>>>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message