spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yiannis Gkoufas <>
Subject DataFrame operation on parquet: GC overhead limit exceeded
Date Wed, 18 Mar 2015 13:15:08 GMT
Hi there,

I was trying the new DataFrame API with some basic operations on a parquet
I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a
standalone cluster mode.
The code is the following:

val people = sqlContext.parquetFile("/data.parquet");
val res =

The dataset consists of 16 billion entries.
The error I get is java.lang.OutOfMemoryError: GC overhead limit exceeded

My configuration is:

spark.serializer    org.apache.spark.serializer.KryoSerializer
spark.driver.memory    6g
spark.executor.extraJavaOptions -XX:+UseCompressedOops
spark.shuffle.manager    sort

Any idea how can I workaround this?

Thanks a lot

View raw message