spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 李明伟 <kramer2...@126.com>
Subject Re:Re: Why very small work load cause GC overhead limit?
Date Wed, 20 Apr 2016 03:08:03 GMT
The memory parameters :    --executor-memory 8G --driver-memory 4G. Please note that the data
size is very small. Total size of the data is less than 10M


As per jmap. It is a little hard for me to do so. I am not a java developer. I will google
the jmap first, thanks


Regards
Mingwei







At 2016-04-20 11:03:20, "Ted Yu" <yuzhihong@gmail.com> wrote:
>Can you tell us the memory parameters you used ?
>
>If you can capture jmap before the GC limit was exceeded, that would give us more clue.

>
>Thanks
>
>> On Apr 19, 2016, at 7:40 PM, "kramer2009@126.com" <kramer2009@126.com> wrote:
>> 
>> Hi All
>> 
>> I use spark doing some calculation. 
>> The situation is 
>> 1. New file will come into a folder periodically
>> 2. I turn the new files into data frame and insert it into an previous data
>> frame.
>> 
>> The code is like below :
>> 
>> 
>>    # Get the file list in the HDFS directory
>>    client = InsecureClient('http://10.79.148.184:50070')
>>    file_list = client.list('/test')
>> 
>>    df_total = None
>>    counter = 0
>>    for file in file_list:
>>        counter += 1
>> 
>>        # turn each file (CSV format) into data frame
>>        lines = sc.textFile("/test/%s" % file)
>>        parts = lines.map(lambda l: l.split(","))
>>        rows = parts.map(lambda p: Row(router=p[0], interface=int(p[1]),
>> protocol=p[7],bit=int(p[10])))
>>        df = sqlContext.createDataFrame(rows)
>> 
>>        # do some transform on the data frame
>>        df_protocol =
>> df.groupBy(['protocol']).agg(func.sum('bit').alias('bit'))
>> 
>>        # add the current data frame to previous data frame set
>>        if not df_total:
>>            df_total = df_protocol
>>        else:
>>            df_total = df_total.unionAll(df_protocol)
>> 
>>        # cache the df_total
>>        df_total.cache()
>>        if counter % 5 == 0:
>>            df_total.rdd.checkpoint()
>> 
>>        # get the df_total information
>>        df_total.show()
>> 
>> 
>> I know that as time goes on, the df_total could be big. But actually, before
>> that time come, the above code already raise exception.
>> 
>> When the loop is about 30 loops. The code throw GC overhead limit exceeded
>> exception. The file is very small so even 300 loops the data size could only
>> be about a few MB. I do not know why it throw GC error.
>> 
>> The exception detail is below :
>> 
>>    16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread
>> task-result-getter-2
>>    java.lang.OutOfMemoryError: GC overhead limit exceeded
>>        at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>>        at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>>        at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>>        at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:606)
>>        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>>        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>>        at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>        at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>        at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>>        at
>> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
>>        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>>        at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>>        at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:606)
>>        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>>        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>>        at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>        at
>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
>>        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>>        at
>> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
>>        at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>>        at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>>        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>        at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>>        at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>>    Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC
>> overhead limit exceeded
>>        at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>>        at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>>        at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>>        at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:606)
>>        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>>        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>>        at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>        at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>        at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>>        at
>> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
>>        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>>        at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>>        at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:606)
>>        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>>        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>>        at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>        at
>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
>>        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>>        at
>> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
>>        at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>>        at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>>        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>        at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>>        at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-very-small-work-load-cause-GC-overhead-limit-tp26803.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 
Mime
View raw message