spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Girardot <o.girar...@lateral-thoughts.com>
Subject Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails
Date Fri, 03 May 2019 09:36:11 GMT
Hi,
I did not try on another vendor, so I can't say if it's only related to
gke, and no, I did not notice anything on the kubelet or kube-dns
processes...

Regards

Le ven. 3 mai 2019 à 03:05, Li Gao <ligao101@gmail.com> a écrit :

> hi Olivier,
>
> This seems a GKE specific issue? have you tried on other vendors ? Also on
> the kubelet nodes did you notice any pressure on the DNS side?
>
> Li
>
>
> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
> o.girardot@lateral-thoughts.com> wrote:
>
>> Hi everyone,
>> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
>> and sometimes while running these jobs a pretty bad thing happens, the
>> driver (in cluster mode) gets scheduled on Kubernetes and launches many
>> executor pods.
>> So far so good, but the k8s "Service" associated to the driver does not
>> seem to be propagated in terms of DNS resolution so all the executor fails
>> with a "spark-application-......cluster.svc.local" does not exists.
>>
>> All executors failing the driver should be failing too, but it considers
>> that it's a "pending" initial allocation and stay stuck forever in a loop
>> of "Initial job has not accepted any resources, please check Cluster UI"
>>
>> Has anyone else observed this king of behaviour ?
>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
>> exist even after the "big refactoring" in the kubernetes cluster scheduler
>> backend.
>>
>> I can work on a fix / workaround but I'd like to check with you the
>> proper way forward :
>>
>>    - Some processes (like the airflow helm recipe) rely on a "sleep 30s"
>>    before launching the dependent pods (that could be added to
>>    /opt/entrypoint.sh used in the kubernetes packing)
>>    - We can add a simple step to the init container trying to do the DNS
>>    resolution and failing after 60s if it did not work
>>
>> But these steps won't change the fact that the driver will stay stuck
>> thinking we're still in the case of the Initial allocation delay.
>>
>> Thoughts ?
>>
>> --
>> *Olivier Girardot*
>> o.girardot@lateral-thoughts.com
>>
>

Mime
View raw message