spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yiannis Gkoufas <johngou...@gmail.com>
Subject DataFrame operation on parquet: GC overhead limit exceeded
Date Wed, 18 Mar 2015 13:15:08 GMT
Hi there,

I was trying the new DataFrame API with some basic operations on a parquet
dataset.
I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a
standalone cluster mode.
The code is the following:

val people = sqlContext.parquetFile("/data.parquet");
val res =
people.groupBy("name","date").agg(sum("power"),sum("supply")).take(10);
System.out.println(res);

The dataset consists of 16 billion entries.
The error I get is java.lang.OutOfMemoryError: GC overhead limit exceeded

My configuration is:

spark.serializer    org.apache.spark.serializer.KryoSerializer
spark.driver.memory    6g
spark.executor.extraJavaOptions -XX:+UseCompressedOops
spark.shuffle.manager    sort

Any idea how can I workaround this?

Thanks a lot

Mime
View raw message