spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Warren Zhu <warren.zh...@gmail.com>
Subject How to troubleshoot MetadataFetchFailedException: Missing an output location for shuffle 0
Date Mon, 16 Dec 2019 20:00:45 GMT
Hi All,

I have seen this exception many times in my production environment for long
running batch job. Is there some stigmatization of all root causes of this
exception? Below is my analysis:

1. This happens when executor try to fetch MapStatus of some shuffle.
2. Each executor maintains a local cache of all map statuses. When can't
find in local cache, executor will try to fetch latest from driver which
acting as MapOutputTrackerMaster.
3. Driver's map statuses only be clear when epoch got updated.
4. Epoch got updated when new executor got restarted. This might be caused
by executor lost. I have double confirmed this if one container(executor)
is kill by Yarn for exceeding memory limits, then this exception will
happen.

So I have 3 questions:
1. Is my analysis correct?
2. Is there some other clues or causes could result in this exception?
3. How to fix this exception?

Thanks,
Warren

Mime
View raw message