Thanks, I noticed this after posting. I'll try that.
I also think that perhaps Clojure might be creating more classes than the equivalent Java would, so I'll nudge it a bit higher.

On 9 January 2015 at 11:45, Sean Owen <sowen@cloudera.com> wrote:
It's normal for PermGen to be a bit more of an issue with Spark than
for other JVM-based applications. You should simply increase the
PermGen size, which I don't see in your command. -XX:MaxPermSize=256m
allows it to grow to 256m for example. The right size depends on your
total heap size and app.

Also, Java 8 no longer has a permanent generation, so this particular
type of problem and tuning is not needed. You might consider running
on Java 8.

On Fri, Jan 9, 2015 at 10:38 AM, Joe Wass <jwass@crossref.org> wrote:
> I'm running on an AWS cluster of 10 x m1.large (64 bit, 7.5 GiB RAM). FWIW
> I'm using the Flambo Clojure wrapper which uses the Java API but I don't
> think that should make any difference. I'm running with the following
> command:
>
> spark/bin/spark-submit --class mything.core --name "My Thing" --conf
> spark.yarn.executor.memoryOverhead=4096 --conf
> spark.executor.extraJavaOptions="-XX:+CMSClassUnloadingEnabled
> -XX:+CMSPermGenSweepingEnabled" /root/spark/code/myjar.jar
>
> For one of the stages I'm getting errors:
>
>  - ExecutorLostFailure (executor lost)
>  - Resubmitted (resubmitted due to lost executor)
>
> And I think they're caused by slave executor JVMs dying up with this error:
>
> java.lang.OutOfMemoryError: PermGen space
>         java.lang.Class.getDeclaredConstructors0(Native Method)
>         java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)
>         java.lang.Class.getConstructor0(Class.java:2885)
>         java.lang.Class.newInstance(Class.java:350)
>
> sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399)
>
> sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:396)
>         java.security.AccessController.doPrivileged(Native Method)
>
> sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:395)
>
> sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:113)
>
> sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:331)
>
> java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
>         java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
>         java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
>         java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
>         java.security.AccessController.doPrivileged(Native Method)
>         java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
>         java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
>         java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
>
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
>         java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>
>
> 1 stage out of 14 (so far) is failing. My failing stage is 1768 succeeded /
> 1862 (940 failed). 7 tasks failed with OOM, 919 were "Resubmitted
> (resubmitted due to lost executor)".
>
> Now my "Aggregated Metrics by Executor" shows that 10 out of 16 executors
> show "CANNOT FIND ADDRESS" which I imagine means the JVM blew up and hasn't
> been restarted. Now the 'Executors' tab shows only 7 executors.
>
>  - Is this normal?
>  - Any ideas why this is happening?
>  - Any other measures I can take to prevent this?
>  - Is the rest of my app going to run on a reduced number of executors?
>  - Can I re-start the executors mid-application? This is a long-running job,
> so I'd like to do what I can whilst it's running, if possible.
>  - Am I correct in thinking that the --conf arguments are supplied to the
> JVMs of the slave executors, so they will be receiving the extraJavaOptions
> and memoryOverhead?
>
> Thanks very much!
>
> Joe