I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler, and sometimes while running these jobs a pretty bad thing happens, the driver (in cluster mode) gets scheduled on Kubernetes and launches many executor pods.
So far so good, but the k8s "Service" associated to the driver does not seem to be propagated in terms of DNS resolution so all the executor fails with a "spark-application-......cluster.svc.local" does not exists.
All executors failing the driver should be failing too, but it considers that it's a "pending" initial allocation and stay stuck forever in a loop of "Initial job has not accepted any resources, please check Cluster UI"
Has anyone else observed this king of behaviour ?
We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to exist even after the "big refactoring" in the kubernetes cluster scheduler backend.
I can work on a fix / workaround but I'd like to check with you the proper way forward :
- Some processes (like the airflow helm recipe) rely on a "sleep 30s" before launching the dependent pods (that could be added to /opt/entrypoint.sh used in the kubernetes packing)
- We can add a simple step to the init container trying to do the DNS resolution and failing after 60s if it did not work
But these steps won't change the fact that the driver will stay stuck thinking we're still in the case of the Initial allocation delay.