I've discovered an issue with event logger, specifically reading incomplete event log file which is compressed with 'zstd' - the reader thread got stuck on reading that file.
This is very easy to reproduce: setting configuration as below
and start Spark application. While the application is running, load the application in SHS webpage. It may succeed to replay the event log, but high likely it will be stuck and loading page will be also stuck.
Please refer SPARK-29322 for more details.
As the issue only occurs with 'zstd', the simplest approach is dropping support of 'zstd' for event log. More general approach would be introducing timeout on reading event log file, but it should be able to differentiate thread being stuck vs thread busy with reading huge event log file.
Which approach would be preferred in Spark community, or would someone propose better ideas for handling this?
Jungtaek Lim (HeartSaVioR)