Kubernetes controllers use informers to connect to APIServers, it has a famous LIST-WATCH mechanism in it.

The problem with this is, when an APIServer crashes / restarts, all controllers connected to the APIServer will be forced to reconnect, and send the LIST request again to APIServer.

LIST requests are very expensive, while WATCH requests are quite cheap. All controllers sending LIST requests to an APIServer in a short time window will make sure that APIServer is overwhelmed again, and never succeed in starting up. The cycle will repeat endlessly, APIServer will get stuck at CrashLoopBackOff ad infinitum.

However, if all clients get past the LIST phase and get to the WATCH phase, then APIServer load will drop drastically.

So, the conclusion is, this peculiar LIST-WATCH design of Kubernetes, makes the load that the APIServer faces very unpredictable.

This Netflix tech blog is very helpful in solving issues like this one: https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581

I wonder why Kubernetes doesn’t have something like this built-in.

(to be continued)