Hello all,

I was running some spark job and some executors failed without error info. The executors were dead and new executors were requested but on the spark web UI,  no failure found. Normally, if it's memory issue, I could find OOM ther, but not this time.

Configuration:
1. each executor has 25G memory
2. garbage collection: G1GC

From the garbage collector information (please see below), it looks like the executor still has heap space. So, I'm not so sure how should I improved it.

One thing I can try is to configure Metaspace, e.g. -XX:MaxMetaspaceSize=256m. Any other suggestions? Thanks a lot.

Best

Xuan

4731.701: [GC pause (G1 Evacuation Pause) (young), 0.0965042 secs]
   [Parallel Time: 85.5 ms, GC Workers: 28]
      [GC Worker Start (ms): Min: 4731702.0, Avg: 4731722.0, Max: 4731746.6, Diff: 44.6]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 1.9, Diff: 1.8, Sum: 6.4]
      [Update RS (ms): Min: 0.0, Avg: 22.1, Max: 40.8, Diff: 40.8, Sum: 617.7]
         [Processed Buffers: Min: 0, Avg: 58.5, Max: 195, Diff: 195, Sum: 1639]
      [Scan RS (ms): Min: 0.1, Avg: 1.2, Max: 1.9, Diff: 1.8, Sum: 34.1]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.3]
      [Object Copy (ms): Min: 36.4, Avg: 38.0, Max: 41.6, Diff: 5.1, Sum: 1065.0]
      [Termination (ms): Min: 0.0, Avg: 3.3, Max: 3.6, Diff: 3.6, Sum: 91.7]
         [Termination Attempts: Min: 1, Avg: 29.9, Max: 45, Diff: 44, Sum: 837]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.3, Max: 0.6, Diff: 0.5, Sum: 8.1]
      [GC Worker Total (ms): Min: 40.4, Avg: 65.1, Max: 85.4, Diff: 45.0, Sum: 1824.2]
      [GC Worker End (ms): Min: 4731786.9, Avg: 4731787.2, Max: 4731787.4, Diff: 0.5]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 1.5 ms]
   [Other: 9.4 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.0 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.6 ms]
      [Humongous Reclaim: 1.1 ms]
      [Free CSet: 5.0 ms]
   [Eden: 8280.0M(1136.0M)->0.0B(14.9G) Survivors: 144.0M->64.0M Heap: 10.5G(25.0G)->2021.9M(25.0G)]
 [Times: user=1.46 sys=0.01, real=0.09 secs] 
Heap
 garbage-first heap   total 26214400K, used 11822248K [0x0000000180000000, 0x000000018040c800, 0x00000007c0000000)
  region size 4096K, 2325 young (9523200K), 16 survivors (65536K)
 Metaspace       used 76023K, capacity 77317K, committed 77616K, reserved 1118208K
  class space    used 8649K, capacity 8958K, committed 9008K, reserved 1048576K
 Concurrent marking:
      0   init marks: total time =     0.00 s (avg =     0.00 ms).
     10      remarks: total time =     0.70 s (avg =    70.35 ms).
           [std. dev =    51.84 ms, max =   159.84 ms]
        10  final marks: total time =     0.03 s (avg =     3.43 ms).
              [std. dev =     2.84 ms, max =     9.66 ms]
        10    weak refs: total time =     0.67 s (avg =    66.92 ms).
              [std. dev =    52.19 ms, max =   157.75 ms]
     10     cleanups: total time =     0.31 s (avg =    30.59 ms).
           [std. dev =    24.03 ms, max =    71.07 ms]
    Final counting total time =     0.05 s (avg =     5.13 ms).
    RS scrub total time =     0.18 s (avg =    18.46 ms).
  Total stop_world time =     1.01 s.
  Total concurrent time =    42.94 s (   42.06 s marking).