spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <>
Subject Re: OutOfMemoryError when using DataFrame created by Spark SQL
Date Wed, 25 Mar 2015 09:37:01 GMT
Can you try giving Spark driver more heap ?


> On Mar 25, 2015, at 2:14 AM, Todd Leo <> wrote:
> Hi,
> I am using Spark SQL to query on my Hive cluster, following Spark SQL and DataFrame Guide
step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError
was raised. The expected result of such query is considered to be small (by adding limit 1000
clause). My code is shown below:
> scala> import sqlContext.implicits._                                             
> scala> val df = sqlContext.sql("""select * from some_table where logdate="2015-03-24"
limit 1000""")
> and the error msg:
> [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] [ActorSystem(sparkDriver)]
Uncaught fatal error from thread [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> the master heap memory is set by -Xms512m -Xmx512m, while workers set by -Xms4096M -Xmx4096M,
which I presume sufficient for this trivial query.
> Additionally, after restarted the spark-shell and re-run the limit 5 query , the df object
is returned and can be printed by, but other APIs fails on OutOfMemoryError, namely,
df.count(),"some_field").show() and so forth.
> I understand that the RDD can be collected to master hence further transmutations can
be applied, as DataFrame has “richer optimizations under the hood” and the convention
from an R/julia user, I really hope this error is able to be tackled, and DataFrame is robust
enough to depend.
> Thanks in advance!
> Todd
> View this message in context: OutOfMemoryError when using DataFrame created by Spark
> Sent from the Apache Spark User List mailing list archive at

View raw message