Alright, here is the situation we were thrown in recently. Our bread n’ butter API got released. It went well initially and then intermittently started producing extraordinarily high response times. Everybody freaked out. We had tested the API for large, consistent load and here it was not withstanding moderate, intermittent traffic.
Our initial reaction was the cache (we are using in-memory, old fashioned System.Runtime.Cache) is getting invalidated after every 30 odd minutes as an absolute expiry. But the instrumentation logs were *not* suggesting that. There was no fixed pattern. It was totally wired! After some more instrumentation and analysis we zeroed down on some heavy singleton objects. We use Ninject as our IoC container. We have couple of objects created on startup and thrown into Ninject singleton bucket to be used later during the lifetime. What we observed that these objects were getting destroyed and created after certain amount of time! And creation of the objects was obviously taking hit. We were freaked out even more. Isn’t the definition of singleton – create object only once? Or is it Ninject – that is destroying the objects? What we found would surprise some of you. Ready for the magic?
While just browsing through the event logs we found following warning message in the Systems log (event source – WAS):
The default idle timeout value is twenty minutes, which means your app pool is shut down after twenty minutes if it’s not being used. Some people want to change this, because it means their apps are a bit slow after twenty minutes of inactivity.
That started the ball rolling. Ah, so it was the good old buddy IIS and the app pool. Even if its Azure and you want to call it as hosted web core, end of the day it is IIS. And IIS hosted website/web services run under app pools. And app pools, the good babies they are tend to recycle. :) They recycle after every 20 minutes (default value) if kept idle.
We had 4 instances and with the moderate traffic, Azure load balancer was not routing traffic to all instances consistently. So couple instances were getting idle and and app pools on them were recycling. After some time – when Azure load balancer used to divert traffic to these idle instances, those singleton objects which were destroyed would get created again and the incoming service call would pay the price to create these objects which is unfair!
To resolve this – we decided (and this solution worked for us and I do not claim that it applies to your scenarios as well. So if you want to try it at your home, make sure to wear helmet :) ) to do couple of things:
1. We had Akamai (or traffic manager) ping the service after every 5/10 seconds. This way the app pool does not go idle and hopefully never recycles.
2. We had a startup task in the web role to go set the IIS app pool idle timeout as 0. This means app pool lives forever. Obviously every now and then Azure fabric controller recycles the machines itself , but that’s ok.. I do not think we can prevent that.
Start up task
Added following cmd as the start up task in the web role.
%windir%\system32\inetsrv\appcmd set config -section:applicationPools -applicationPoolDefaults.processModel.idleTimeout:00:00:00