awk statistical functions

We had already written an awk script to pull durations for a particularly slow web service call. We wanted to understand the distribution of the response times. Often, the average is high, but it is skewed by a number of extremely large samples. To do this, we look to the standard deviation of a set of data. The second awk call is what we scripted very quickly to calculate these statistical formulas.

-bash-4.2$ awk -f parse_eom_calls.awk log/eom-main.log | \
           awk '{s+=$NF;t[++i]=$NF} END \
                {\
                  for (i in t) { \
                     t1[i]=(t[i] - s/NR)^2;\
                     if (t[i] < s/NR) {
                       kurtosis += 1
                     }
                  };\
                  for (i in t) {\
                    d+=t1[i]\
                  };\
                  print "Average:",s/NR,
                        "Median:",t[int(NR/2)],\
                        "Standard Deviation:",sqrt(d/(length(t1)-1)),\
                        "Coefficient of variation:",(sqrt(d/(length(t1)-1)))/(s/NR),\
                        "Kurtosis:",(kurtosis/NR)*100\
                }'
Average: 6086.13 Median: 2909 Standard Deviation: 30952.2 Coefficient of variation: 5.08569
-bash-4.2$

As you can see above, the distribution was very spread.

We also add a simple calculation of kurtosis. This is measure of where the tail lies on a distribution. In our case, almost 90% of the samples are below the mean. Again, a good indication of skewness.

Post navigation

Leave a Reply