“resmgr:cpu quantum” preventing logins??

Below is the content of an email I wrote this morning. We could not login to a single instance database due to the statistics gathering job running long. What is odd is that when we tried to troubleshoot, we couldn’t even login using sqlplus -prelim / as sysdba…not good.

—————————————————————————————————————————

At around 11:20AM on Friday, November 12th, an alert was generated via email every five minutes until noon requesting that the WCAS service be “verified”, as a test query against it did not return in under 60 seconds. This query normally takes less than two seconds. At noon that same day, the DBA team logged in to troubleshoot. We found that we could not login to the instance, as our login would “hang”. We found LDAP error messages in the OS system log file. Thinking (erroneously) these messages may be related, we elected to bounce the instance and troubleshoot after the fact. After the instance was restarted, we found we could not find any useful information, but the issue had “gone away” as well.

On Monday morning, November 15th at 9:43AM, the same alert that was generated on Friday began to be propagated to operations, the DBA’s, and the application support teams. Once again, the DBA’s logged in and found they could not login to the instance. We were able to generate “system state dumps”, which can contain useful information to troubleshoot an issue after the fact. However, when we found no “smoking gun”, we elected to once again restart the instance.

This time however, the problem appeared again shortly after it was restarted. We found we could login, but only intermittently. When we did get logged in, we elected to run a “hanganalyze”. This command prints in human readable format a list of what sessions are blocking others, as well as the operational source of the block (what are they doing). When we did this, we found almost all sessions were blocked waiting for “cpu quantum”. This code tree is taken when a “resource manager” plan is enabled in the instance, and is acting as a CPU traffic cop, of sorts. The particular plan was the one that is enabled when query optimizer statistics are being gathered. Our understanding of this is that Oracle enables this plan to *protect* online users from CPU resource starvation. In other words, the statistics gathering job steps aside when a “real” user comes in.

Obviously, either:

Our understanding is incorrect
It doesn’t work as advertised

I’m thinking number 2 🙂

For now, we have disabled the statistics gathering job as we dig into exactly what it does to ensure:

We understand it
It either:

Doesn’t do it again, or

We have a workaround

For now the problem has been resolved pending further investigation.

—————————————————————————————————————————

Post navigation

Leave a Reply