I thought this was interesting. I always assumed that the RESTART_ATTEMPTS parameter of a cluster resource was only incremented after a failure to restart it. However, after developing an internal training course for our team, I noticed that after a couple of induced failures and the clusterware successfully restarting it, on a third induced failure the clusterware wouldn’t restart it. We found what is below in the $GRID_HOME/log/$(hostname)/agent/crsd/oraagent_oracle/oraagent_oracle.log file…
2012-06-14 13:28:38.129: [ USRTHRD][3959420672] {0:9:6} ClusterSubscriber::SubscriberWorker::InternalClusterSubscriber::handleEventCBexecuting for reason 1
2012-06-14 13:28:38.129: [ USRTHRD][3959420672] {0:9:6} event type is CRS_NOT_RESTARTING
2012-06-14 13:28:38.129: [ USRTHRD][3959420672] {0:9:6} bodylen = 528
2012-06-14 13:28:38.129: [ USRTHRD][3959420672] {0:9:6} -----------BodyBlock----------
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} ACTION='1'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} CLS_TINT='{0:9:6}'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} CURRENT_STATE='OFFLINE'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} DATABASE_TYPE='RAC'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} DB_UNIQUE_NAME='express.home'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} INSTANCE_NAME='express1'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} NAME='ora.express.db'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} NUMBER_OF_ATTEMPTS='2'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} REASON='NOMORE_RESTART_ATTEMPTS'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} RESOURCE_CLASS='database'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} RESOURCE_INCARNATION_NUMBER='4'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} RESOURCE_LOCATION='expressdb1'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} SEQUENCE_NUMBER='300118'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} TARGET_STATE='ONLINE'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} TIMESTAMP='2012-06-14 13:28:38'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} TYPE='ora.database.type'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} USER='SYSTEM'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} Version='11.2.0.3.0'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} CLUSTER_NAME='expresscrs'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} DB_UNIQUE_NAME='express.home'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} ORACLE_CLUSTERWARE.SUBCOMPONENT='CRSD'
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} RESOURCE_CLASS='database'
Notice the following line…
2012-06-14 13:28:38.130: [ USRTHRD][3959420672] {0:9:6} REASON='NOMORE_RESTART_ATTEMPTS'
After we modified the resource, it successfully restarted it (up to five times).
expressdb1:grid:+ASM1:/home/grid# crsctl modify resource ora.express.db -attr RESTART_ATTEMPTS=5
expressdb1:grid:+ASM1:/home/grid# crs_stat -p ora.express.db
NAME=ora.express.db
TYPE=ora.database.type
ACTION_SCRIPT=
ACTIVE_PLACEMENT=1
AUTO_START=restore
CHECK_INTERVAL=1
DESCRIPTION=Oracle Database resource
FAILOVER_DELAY=0
FAILURE_INTERVAL=60
FAILURE_THRESHOLD=1
GEN_START_OPTIONS@SERVERNAME(expressdb1)=open
GEN_START_OPTIONS@SERVERNAME(expressdb2)=open
GEN_USR_ORA_INST_NAME@SERVERNAME(expressdb1)=express1
GEN_USR_ORA_INST_NAME@SERVERNAME(expressdb2)=express2
HOSTING_MEMBERS=
PLACEMENT=restricted
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=60
START_TIMEOUT=600
STOP_TIMEOUT=600
UPTIME_THRESHOLD=1h
USR_ORA_INST_NAME@SERVERNAME(expressdb1)=express1
USR_ORA_INST_NAME@SERVERNAME(expressdb2)=express2
expressdb1:grid:+ASM1:/home/grid#
The RESTART_ATTEMPTS parameter must apply to any restart performed by the clusterware. As soon as it is manually started, the restart counter is reset to 0.