We had already written an awk script to pull durations for a particularly slow web service call. We wanted to understand the distribution of the response times. Often, the average is high, but it is skewed by a number of extremely large samples. To do this, we look to the standard deviation of a set of data. The second awk call is what we scripted very quickly to calculate these statistical formulas.
-bash-4.2$ awk -f parse_eom_calls.awk log/eom-main.log | \
awk '{s+=$NF;t[++i]=$NF} END \
{\
for (i in t) { \
t1[i]=(t[i] - s/NR)^2;\
if (t[i] < s/NR) {
kurtosis += 1
}
};\
for (i in t) {\
d+=t1[i]\
};\
print "Average:",s/NR,
"Median:",t[int(NR/2)],\
"Standard Deviation:",sqrt(d/(length(t1)-1)),\
"Coefficient of variation:",(sqrt(d/(length(t1)-1)))/(s/NR),\
"Kurtosis:",(kurtosis/NR)*100\
}'
Average: 6086.13 Median: 2909 Standard Deviation: 30952.2 Coefficient of variation: 5.08569
-bash-4.2$
As you can see above, the distribution was very spread.
We also add a simple calculation of kurtosis. This is measure of where the tail lies on a distribution. In our case, almost 90% of the samples are below the mean. Again, a good indication of skewness.