By now, most serious DBAs and SAs should have a basic grasp of queuing theory concepts such as offered load, even if not actually using the maths everyday. A busy system may have thousands of processes, each waiting for their turn on a few dozen cores (CPUs), which are the queues in our system. The behavior of such a system looks like this, the arrival rate on the x
axis against the service time on y
:
From this we can see that a system behaves normally up until a certain rate of arrival of new work (i.e. processes requiring time on the CPU) and then deteriorates rapidly. That typically happens when a system is around 80% of its theoretical maximum capacity. This effect is more pronounced the more cores you have in a system. The Hammer has done a lot to bring this knowledge into the Oracle community, much of which is obvious but which needs to be stated explicitly to be reasoned about, such as the total CPU time to service a request is the sum of the time actually on the CPU + the time spent waiting to get onto the CPU on the first place. It makes sense to assume that the true CPU time for a given computation is fixed, all other things being equal, i.e. that it will take
c
clock cycles to execute n
instructions. We can easily see the time a process spends in kernel, userland and idle. Now kernel time is usually considered to be time spent in system calls but it can be something else too, which I don’t believe is well instrumented†: the time taken from the instant that our process reaches the CPU, to the time that the first userland code in the process is executed, or the program counter is incremented, however you prefer to think of it. I’ll call this time transition time, so total CPU is time waiting in the queue, plus time on the CPU, plus time taken to transition between the two states‡. Or the time between useful work being done, in kernel or userland, I suppose that could be called faff time (e.g. context switching is transition, stalling on a cache miss is faffing).
Recently we had a performance issue where we saw that the CPU utilization on one of our systems was increasing far more rapidly than the increase in workload in transactions per second would seem to justify. Suspecting that we were over on the right-hand side of the graph, a colleague of mine set about reproducing the problem on a test system. He needed to artificially create some load on the system to get it near the tipping point so he could examine a specific transaction’s behavior without too much noise, so wrote a simple load generator that spun on an infinite loop, and started one per core. Top reported the expected load average but… There was no degradation in the performance of the database! So what was happening? Or more precisely, what was happening in production OLTP that wasn’t happening in the test environment?
Well after discussing it, we quickly figured it out: the real system was having to do some other work than just that directly on behalf of our application – managing its L1 and L2 caches, for a start! Generating the load by allocating and filling an array of a few million random numbers and iterating over them doing some trivial computation would be much more representative. Even better would be to have some process-local data and some in shared memory connected to be all load-generating processes, reading and writing both in the course of the computation, to stress the bus/IPC subsystem too, but this was enough for some useful insight. The transition/faff time means that fewer useful cycles are available per quanta, implying more transitions will be needed to execute our n
instructions. Every transition comes with its own baggage of faff as the caches are dumped and reloaded. Access to semaphore-protected shared memory is also a queue and thus behaves in the same way as any queue as per the graph (different numbers, same overall shape). And all the time new work is arriving so the queues are growing.
Eventually these effects come to dominate then rapidly overwhelm the system, yet it is difficult to account for them as anything other than “CPU time”, the sum of in-queue, transition, on-CPU and faff, with no further breakdown. As an Oracle DBA I can very easily account for logical vs physical I/O, or as a SA, virtual memory page faults, but not the equivalent on the CPU. Spinning in a loop was cheap to context switch, and could easily live in the cache, and needed to queue for a CPU only, so while the numbers in top
looked correct, the system wasn’t really that busy. Load average is less and less useful the more you know about it… In fact, I would argue that it is not really useful at all for answering the basic question of how busy is my system?, and I strongly question why so much monitoring and alerting is based on it!
† If anyone knows differently, please shout! Common tools don’t have, e.g. columns for %stall
and %switch
.
‡ And to tear down at the end of the time slice? I don’t actually know. I think we can safely overlook time spent sorting the run queue as insignificant with modern schedulers.
Pingback: The Importance of Representative Simulation « Diary of a Confused DBA…..
Pingback: Failing Gracefully | So I decided to take my work back underground