I thought this was interesting. By taking the total response time of pages from the access log, you can take the amount of server CPU used and compare it to the total elapsed request time. This works for web servers, so if you have a lot of business logic code on the server that wouldn’t be accounted for in page response times, this won’t be accurate. It’s not perfect, but it’s definitely good enough for conversation.
So how do we do this?
First, we take the CPU used in the period in which we are interested…
-bash-4.1$ sar -s 11:30:00 -e 11:40:00 Linux 2.6.32-220.el6.x86_64 (foobar1.com) 09/27/2014 _x86_64_ (1 CPU) 11:30:01 AM CPU %user %nice %system %iowait %steal %idle 11:31:01 AM all 72.68 0.00 11.04 0.00 0.00 16.28 11:32:01 AM all 39.67 0.00 3.35 0.03 0.00 56.95 11:33:01 AM all 31.82 0.00 1.08 0.02 0.00 67.08 11:34:01 AM all 31.81 0.00 2.01 0.08 0.00 66.09 11:35:01 AM all 33.78 0.00 1.63 0.03 0.00 64.56 11:36:01 AM all 26.67 0.00 1.22 0.03 0.00 72.08 11:37:01 AM all 28.18 0.00 1.17 0.08 0.00 70.57 11:38:01 AM all 28.43 0.00 1.17 0.02 0.00 70.39 11:39:01 AM all 24.21 0.00 0.95 0.07 0.00 74.77 Average: all 35.19 0.00 2.61 0.04 0.00 62.16 -bash-4.1$
We then take the time period in which we are interested and get the total page response time. As shown above, we have only a single CPU, so we adjust the printed elapsed time percentages to account for this…
-bash-4.1$ awk '$3 ~ "2014:11:3" {split($3,t,":");s[t[2]":"t[3]]+=$NF} END {for (i in s) {print i,s[i]/1000/(1*60)*100}}' /var/log/jbossas/p_ecm_01/localhost_access_log.2014-09-27.log | sort 11:30 293.29 11:31 174.033 11:32 57.7817 11:33 60.3083 11:34 61.2817 11:35 56.05 11:36 55.3567 11:37 51.3367 11:38 48.24 11:39 34.8517
As you can visually see above, early in the test we used 72% of our available CPU time, and 293% of wall clock time (any system has an infinite capacity for waiting). As such, CPU was less than a quarter of wait time. However, as we near the end of the test, we see we use 24% of available CPU time, and the corresponding wall clock time is 34%. As such, CPU as a percentage of total response time increases to over 70% as we get farther into the test. This is a cached application, so early in the test, wait times were higher as data was fetched from the database. As a result, CPU time as a percentage of total response time was lower. However, later in the test when data was being managed in memory, CPU started to take more of the pie as less external waits occurred.