Analyzing cache transfers between instances in a RAC

We were initially taught by oracle that RAC was a panacea. Just add nodes on the fly, and the workload of a database instance will be automatically distributed over all available serves in the cluster. However, as time wore on, users began to realize that while moving database blocks between servers over the “high speed interconnect” was much faster than getting them from disk, it was not free. The process of managing the blocks in memory and who “owned” them proved to be fairly expensive. As such, user groups began calling for segmenting “workloads” on given servers in the cluster. While this did not provide for the plug in play scalability originally marketed by oracle, it did provide some scalability, and definitely fault tolerance. In other words, if I had four servers in my cluster, and decided to run updates on server4, I could expect my workload to continue processing if server4 failed. This is because I was told to configure the service running on server4 to be “available” on servers 1, 2 and 3. This proved to be workable.

However, things like this always drive me crazy. I decided to dig into a methodology to determine the provable cost of transferring blocks between instances. Luckily, our ole’ friend AWR came to the rescue. There is a view called dba_hist_inst_cache_transfer while provides all this information. The query below looks at the number of read consistent blocks found to be “busy” during all windows contained in our AWR.

select a.begin_interval_time,
       a.inst2 inst2_to_inst1,
       a.inst3 inst3_to_inst1,
       b.inst1 inst1_to_inst2,
       b.inst3 inst3_to_inst2,
       c.inst1 inst1_to_inst3,
       c.inst2 inst2_to_inst3
from 
(select * 
  from (select distinct begin_interval_time, 
                        inst2,
                        inst3
          from (select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') begin_interval_time,
                       dhss.instance_number,
                       cr_busy - lag(cr_busy, 1, 0) over (order by dhss.instance_number, dhss.snap_id) as value_delta
                  from dba_hist_snapshot dhs,
                       dba_hist_inst_cache_transfer dhss
                  where dhs.snap_id = dhss.snap_id
                    and dhs.instance_number = dhss.instance_number
                    and class = 'data block'
                    and instance = 1
                  order by 1) c
          model
            ignore nav
            dimension by(c.instance_number i,
                         c.begin_interval_time begin_interval_time)
            measures(c.value_delta value_delta, 
                     0 inst2, 
                     0 inst3)
            rules(inst2[any,any] = value_delta[i = 2, cv(begin_interval_time)],
                  inst3[any,any] = value_delta[i = 3, cv(begin_interval_time)])
          order by 1)
  where (inst2>= 0 and inst3>= 0)) a,
  (select * 
    from (select distinct begin_interval_time, 
                          inst1,
                          inst3
            from (select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') begin_interval_time,
                         dhss.instance_number,
                         cr_busy - lag(cr_busy, 1, 0) over (order by dhss.instance_number, dhss.snap_id) as value_delta
                    from dba_hist_snapshot dhs,
                         dba_hist_inst_cache_transfer dhss
                    where dhs.snap_id = dhss.snap_id
                      and dhs.instance_number = dhss.instance_number
                      and class = 'data block'
                      and instance = 2
                    order by 1) c
            model
              ignore nav
              dimension by(c.instance_number i,
                           c.begin_interval_time begin_interval_time)
              measures(c.value_delta value_delta, 
                       0 inst1, 
                       0 inst3)
              rules(inst1[any,any] = value_delta[i = 1, cv(begin_interval_time)],
                    inst3[any,any] = value_delta[i = 3, cv(begin_interval_time)])
            order by 1)
  where (inst1>= 0 and inst3>= 0)) b,
  (select * 
    from (select distinct begin_interval_time, 
                          inst1,
                          inst2
            from (select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') begin_interval_time,
                         dhss.instance_number,
                         cr_busy - lag(cr_busy, 1, 0) over (order by dhss.instance_number, dhss.snap_id) as value_delta
                    from dba_hist_snapshot dhs,
                         dba_hist_inst_cache_transfer dhss
                    where dhs.snap_id = dhss.snap_id
                      and dhs.instance_number = dhss.instance_number
                      and class = 'data block'
                      and instance = 3
                    order by 1) c
            model
              ignore nav
              dimension by(c.instance_number i,
                           c.begin_interval_time begin_interval_time)
              measures(c.value_delta value_delta, 
                       0 inst1, 
                       0 inst2)
              rules(inst1[any,any] = value_delta[i = 1, cv(begin_interval_time)],
                    inst2[any,any] = value_delta[i = 2, cv(begin_interval_time)])
            order by 1)
  where (inst1>= 0 and inst2>= 0)) c
  where a.begin_interval_time = b.begin_interval_time
    and a.begin_interval_time = c.begin_interval_time

When I ran what is above, I found a couple of periods where CR_BUSY was higher than for others. I went to the dba_hist_sql_stat view for one of the periods using something similar to what is below.

select * 
  from (select dbms_lob.substr(sql_text,4000,1), 
               dhss.sql_id,
               to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'), 
               dhs.instance_number,
               clwait_delta  
          from dba_hist_sqltext dhst, 
               dba_hist_snapshot dhs, 
               dba_hist_sqlstat dhss 
          where dhst.sql_id = dhss.sql_id 
            and dhss.snap_id = dhs.snap_id 
            and dhss.instance_number = dhs.instance_number 
            and dhs.begin_interval_time between to_date('2009042522','YYYYMMDDHH24') 
                                            and to_date('2009042601','YYYYMMDDHH24')
  order by 5 desc)
  where rownum <= 10

A statement bubbled to the top, where I could begin further investigation.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.