Instance not restarting after failure in 11.2.0.3

I thought this was interesting. I always assumed that the RESTART_ATTEMPTS parameter of a cluster resource was only incremented after a failure to restart it. However, after developing an internal training course for our team, I noticed that after a couple of induced failures and the clusterware successfully restarting it, on a third induced failure the clusterware wouldn’t restart it. We found what is below in the $GRID_HOME/log/$(hostname)/agent/crsd/oraagent_oracle/oraagent_oracle.log file…

2012-06-14 13:28:38.129: [ USRTHRD][3959420672] {0:9:6} ClusterSubscriber::SubscriberWorker::InternalClusterSubscriber::handleEventCBexecuting for reason 1
2012-06-14 13:28:38.129: [ USRTHRD][3959420672] {0:9:6} event type is CRS_NOT_RESTARTING
2012-06-14 13:28:38.129: [ USRTHRD][3959420672] {0:9:6} bodylen = 528
2012-06-14 13:28:38.129: [ USRTHRD][3959420672] {0:9:6} -----------BodyBlock----------
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  ACTION='1'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  CLS_TINT='{0:9:6}'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  CURRENT_STATE='OFFLINE'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  DATABASE_TYPE='RAC'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  DB_UNIQUE_NAME='express.home'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  INSTANCE_NAME='express1'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  NAME='ora.express.db'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  NUMBER_OF_ATTEMPTS='2'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  REASON='NOMORE_RESTART_ATTEMPTS'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  RESOURCE_CLASS='database'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  RESOURCE_INCARNATION_NUMBER='4'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  RESOURCE_LOCATION='expressdb1'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  SEQUENCE_NUMBER='300118'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  TARGET_STATE='ONLINE'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  TIMESTAMP='2012-06-14 13:28:38'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  TYPE='ora.database.type'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  USER='SYSTEM'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  Version='11.2.0.3.0'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  CLUSTER_NAME='expresscrs'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  DB_UNIQUE_NAME='express.home'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  ORACLE_CLUSTERWARE.SUBCOMPONENT='CRSD'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  RESOURCE_CLASS='database'

Notice the following line…

2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6}  REASON='NOMORE_RESTART_ATTEMPTS'

After we modified the resource, it successfully restarted it (up to five times).

expressdb1:grid:+ASM1:/home/grid# crsctl modify resource ora.express.db -attr RESTART_ATTEMPTS=5
expressdb1:grid:+ASM1:/home/grid# crs_stat -p ora.express.db
NAME=ora.express.db
TYPE=ora.database.type
ACTION_SCRIPT=
ACTIVE_PLACEMENT=1
AUTO_START=restore
CHECK_INTERVAL=1
DESCRIPTION=Oracle Database resource
FAILOVER_DELAY=0
FAILURE_INTERVAL=60
FAILURE_THRESHOLD=1
GEN_START_OPTIONS@SERVERNAME(expressdb1)=open
GEN_START_OPTIONS@SERVERNAME(expressdb2)=open
GEN_USR_ORA_INST_NAME@SERVERNAME(expressdb1)=express1
GEN_USR_ORA_INST_NAME@SERVERNAME(expressdb2)=express2
HOSTING_MEMBERS=
PLACEMENT=restricted
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=60
START_TIMEOUT=600
STOP_TIMEOUT=600
UPTIME_THRESHOLD=1h
USR_ORA_INST_NAME@SERVERNAME(expressdb1)=express1
USR_ORA_INST_NAME@SERVERNAME(expressdb2)=express2

expressdb1:grid:+ASM1:/home/grid#

The RESTART_ATTEMPTS parameter must apply to any restart performed by the clusterware. As soon as it is manually started, the restart counter is reset to 0.

Post navigation

Leave a Reply