We were initially taught by oracle that RAC was a panacea. Just add nodes on the fly, and the workload of a database instance will be automatically distributed over all available serves in the cluster. However, as time wore on, users began to realize that while moving database blocks between servers over the “high speed interconnect” was much faster than getting them from disk, it was not free. The process of managing the blocks in memory and who “owned” them proved to be fairly expensive. As such, user groups began calling for segmenting “workloads” on given servers in the cluster. While this did not provide for the plug in play scalability originally marketed by oracle, it did provide some scalability, and definitely fault tolerance. In other words, if I had four servers in my cluster, and decided to run updates on server4, I could expect my workload to continue processing if server4 failed. This is because I was told to configure the service running on server4 to be “available” on servers 1, 2 and 3. This proved to be workable.
However, things like this always drive me crazy. I decided to dig into a methodology to determine the provable cost of transferring blocks between instances. Luckily, our ole’ friend AWR came to the rescue. There is a view called dba_hist_inst_cache_transfer while provides all this information. The query below looks at the number of read consistent blocks found to be “busy” during all windows contained in our AWR.
select a.begin_interval_time,
a.inst2 inst2_to_inst1,
a.inst3 inst3_to_inst1,
b.inst1 inst1_to_inst2,
b.inst3 inst3_to_inst2,
c.inst1 inst1_to_inst3,
c.inst2 inst2_to_inst3
from
(select *
from (select distinct begin_interval_time,
inst2,
inst3
from (select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') begin_interval_time,
dhss.instance_number,
cr_busy - lag(cr_busy, 1, 0) over (order by dhss.instance_number, dhss.snap_id) as value_delta
from dba_hist_snapshot dhs,
dba_hist_inst_cache_transfer dhss
where dhs.snap_id = dhss.snap_id
and dhs.instance_number = dhss.instance_number
and class = 'data block'
and instance = 1
order by 1) c
model
ignore nav
dimension by(c.instance_number i,
c.begin_interval_time begin_interval_time)
measures(c.value_delta value_delta,
0 inst2,
0 inst3)
rules(inst2[any,any] = value_delta[i = 2, cv(begin_interval_time)],
inst3[any,any] = value_delta[i = 3, cv(begin_interval_time)])
order by 1)
where (inst2>= 0 and inst3>= 0)) a,
(select *
from (select distinct begin_interval_time,
inst1,
inst3
from (select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') begin_interval_time,
dhss.instance_number,
cr_busy - lag(cr_busy, 1, 0) over (order by dhss.instance_number, dhss.snap_id) as value_delta
from dba_hist_snapshot dhs,
dba_hist_inst_cache_transfer dhss
where dhs.snap_id = dhss.snap_id
and dhs.instance_number = dhss.instance_number
and class = 'data block'
and instance = 2
order by 1) c
model
ignore nav
dimension by(c.instance_number i,
c.begin_interval_time begin_interval_time)
measures(c.value_delta value_delta,
0 inst1,
0 inst3)
rules(inst1[any,any] = value_delta[i = 1, cv(begin_interval_time)],
inst3[any,any] = value_delta[i = 3, cv(begin_interval_time)])
order by 1)
where (inst1>= 0 and inst3>= 0)) b,
(select *
from (select distinct begin_interval_time,
inst1,
inst2
from (select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') begin_interval_time,
dhss.instance_number,
cr_busy - lag(cr_busy, 1, 0) over (order by dhss.instance_number, dhss.snap_id) as value_delta
from dba_hist_snapshot dhs,
dba_hist_inst_cache_transfer dhss
where dhs.snap_id = dhss.snap_id
and dhs.instance_number = dhss.instance_number
and class = 'data block'
and instance = 3
order by 1) c
model
ignore nav
dimension by(c.instance_number i,
c.begin_interval_time begin_interval_time)
measures(c.value_delta value_delta,
0 inst1,
0 inst2)
rules(inst1[any,any] = value_delta[i = 1, cv(begin_interval_time)],
inst2[any,any] = value_delta[i = 2, cv(begin_interval_time)])
order by 1)
where (inst1>= 0 and inst2>= 0)) c
where a.begin_interval_time = b.begin_interval_time
and a.begin_interval_time = c.begin_interval_time
When I ran what is above, I found a couple of periods where CR_BUSY was higher than for others. I went to the dba_hist_sql_stat view for one of the periods using something similar to what is below.
select *
from (select dbms_lob.substr(sql_text,4000,1),
dhss.sql_id,
to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'),
dhs.instance_number,
clwait_delta
from dba_hist_sqltext dhst,
dba_hist_snapshot dhs,
dba_hist_sqlstat dhss
where dhst.sql_id = dhss.sql_id
and dhss.snap_id = dhs.snap_id
and dhss.instance_number = dhs.instance_number
and dhs.begin_interval_time between to_date('2009042522','YYYYMMDDHH24')
and to_date('2009042601','YYYYMMDDHH24')
order by 5 desc)
where rownum <= 10
A statement bubbled to the top, where I could begin further investigation.