Unresponsive .NET Web Applications and the Mysterious ThreadPool

September 28, 2019

I’m not going to claim to be an expert on the matter, but I’m hoping some of my recent experiences can help others. The premise is like this:

My C# web application is becoming unresponsive. Kubernetes marks instances as non-ready and takes them out of rotation because they fail to respond the readiness probe in 10 seconds. Once that starts happening, it’s a negative feedback look as more and more instances start failing. Then the application can’t recover because the traffic is too high. I use async/await everywhere I can, but it’s possible that a synchronous network call to something happens during some requests–after all it’s a large app.

There are a couple of things you might initially do:

You do the above, but you are still getting timeouts here and there. You don’t always see the timeouts on the liveness ping, it may show up elsewhere. What’s next? Learning more about how the .NET ThreadPool creates new threads. In particular, read about System.Threading.ThreadPool.SetMinThreads

Here is my understanding:

For the purposes of this explanation, suppose we have a web app that behaves like this:

Here is how this might look if each 10 requests per second come in at the beginning of each the second:

In this fake example, you can see after one second how requests to our app start to take more than 10 seconds, even if our backend returns in 2 seconds, just because of synchronous calls and thread creation delay. On top of that, it won’t recover because the concurrent number of request the app needs to handle grows over time. This mechanism that is supposed to protect us from creating too many threads actually hurts our app here!

If, instead, our minimum worker thread count was set to 25 (request latency × requests/second + buffer), all of the requests take 2 seconds and the first ones will finish before we hit the throttling algorithm. Moral of the story is that you might want to call ThreadPool.SetMinThreads at the beginning of your program to an appropriate value if synchronous code is executed on some of your requests. It’s not just failure to respond to incoming requests that expose this issue–that’s when things have really gotten bad. You may start by seeing timeouts happening mid-request because a worker thread took too to get scheduled again.

It’s a good idea to have metrics covering these. You can use ThreadPool.GetAvailableThreads, ThreadPool.GetMaxThreads, ThreadPool.GetMinThreads, and calculate these:

workerThreadInUse =  workerThreadMax - workerThreadAvailable;
workerThreadFree = workerThreadMin - workerThreadInUse;

You might be running low on capacity when workerThreadFree is low or CPU is high.