Runtime connection load balancing issues in 11.2.0.2

We upgraded our largest and most important databases last month to 11.2.0.2 from 10.2.0.4. We had a few issues, most of which have been resolved. However, one that continued to plague us until yesterday is the apparent inability of any version of the Oracle JDBC drivers to utilize runtime connection load balancing. For those of you that may not know, Oracle provides three different types of load balancing configurations in a RAC.

server side load balancing

Every 30 seconds, the listeners on each server are posted by the pmon process of each instance with the recent load average on the server. When a physical connection request is received by the listener on server2, it can tell the session:

“Yes, I could help you, but right now I am buried. Please talk to the instance on server3, as he recently told me he is not busy at all.”

The listener on server2 provides the URL for the instance on server3, to which the user happily connects and does her work. This method of load balancing at connection time is implemented via the remote_listener parameter, which enables instances of the clustered database to share information such as this with each others listeners.

There are also two other methods, one of which occurs entirely on the client, while the other also occurs on the client using information provided from the server.

client side load balancing

  1. This can be implemented via the LOAD_BALANCE parameter in the URL. This is not recommended, as the client simply round robins requests to each server in the URL regardless of current server load conditions. However, it can work in a pinch
  2. RCLB (Runtime Connection Load Balancing) – This is the real meat and potatoes of Oracle load balancing, and has always worked flawlessly for us. However, it is now the one that appeared to no longer work in 11.2.0.2

The way RCLB works is that every 30 seconds, a percentage of connections to send to each instance of the clustered database is posted to an internal queue (Advanced Queuing). This percentage is determined based on recent performance statistics that the Oracle software considers, as well as the goal of the service in question (DBMS_SERVICE)

If configured to subscribe to ONS, java clients connect to the ONS port on each server in the cluster, each of which subscribe to this queue. As events with the percentage to direct to each instance are dequeued by ONS, they are in turn provided to the client. For example, a UniversalConnectionPool in 11.2.0.2 or OracleDataSource in 10.2.0.4 may have 50 physical connections it is maintaining in a HashMap in the JVM. As application sessions in that JVM request a connection, they are fetched from this HashMap by the connection pool manager and “lent” to the application session. To make the decision as to which currently unused connection to lend to the application session, the connection manager (ideally) considers the information it has been provided by the ONS software running on the server. As noted, in 10.2.0.4 this was flawless.

To ensure the database is doing its job, run the following query and you may see something similar in your queue…

SQL> select to_char(from_tz(enq_time,'GMT') at time zone sessiontimezone,'YYYY/MM/DD HH24:MI:SS'),t.user_data.payload from sys.sys$service_metrics_tab t where rownum = 1 order by t.enq_time;

TO_CHAR(FROM_TZ(ENQ
-------------------
USER_DATA.PAYLOAD
-------------------------------------------------------------------------------------------------------------------------------2011/09/25 22:26:09
VERSION=1.0 database=wcprod service=wcpfan { {instance=wcprod1 percent=49 flag=GOOD aff=TRUE}{instance=wcprod2 percent=51 flag=GOOD aff=TRUE} } timestamp=2011-09-25 22:26:09

SQL>

You can also grep PAYLOAD in the oraagent log under $GRID_INFRASTRUCTURE_HOME/log/$(hostname)/agent/crsd/oraagent_oracle.

If you see output in either source similar to what is above, move on to the ONS process, as the database is posting the load balancing information. Enable logging in ONS by adding “debugcomp=onssubscribe” without the quotes to your ons.config file under your Grid Infrastructure home, and then restart ONS.

When we did this and reconnected from the application, we found the following exceptions were being generated.

11/09/30 13:50:58 [ons-subscribe] Connection 172f,172.31.111.1,6200 Subscriber 99
        unmatched '(' (offset 1)
11/09/30 13:52:53 [ons-subscribe] Connection 1735,172.31.111.1,6200 Subscriber 99
        unmatched '(' (offset 1)
11/09/30 13:53:23 [ons-subscribe] Connection 1735,172.31.111.1,6200 Subscriber 99
        unmatched '(' (offset 1)

We thought this was the issue, but when we dug further, we found the problem actually existed elsewhere. We knew this because we ended up turning on a network sniffer on the client in another test environment with the same issue.

[root@howards ~]# tcpdump -iwlan0 tcp port 6200 -A -s1500 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on wlan0, link-type EN10MB (Ethernet), capture size 1500 bytes

After starting up our test with JDBC drivers, we first see a connection from our client (192.168.1.110) to our database (dell11gr1)

12:27:20.042182 IP 192.168.1.110.47865 > dell11gr1.6200: P 1:53(52) ack 1 win 46 
..^@..X.POST /connect HTTP/1.1
Version: 3
FormFactor: 

…and we then see a status posted back to our client by ONS…

12:27:20.047819 IP dell11gr1.6200 > 192.168.1.110.47865: P 1:265(264) ack 53 win 46 
..X...^@POST /status HTTP/1.1
Version: 4 
Content-Length: 0
Result: success
Message: 
hostName: dell11gr1
clusterId: dbInstance_dell11gr1_6200
clusterName: dbInstance_dell11gr1_6200
instanceId: dbInstance_dell11gr1_6200
instanceName: dbInstance_dell11gr1_6200

…and next, we see our client subscribe to our “online.home” service…

12:27:20.771187 IP 192.168.1.110.47865 > dell11gr1.6200: P 53:187(134) ack 265 win 54 
..a...X.POST /subscribe HTTP/1.1
Content-Length: 0
SubscriberID: 1
Subscription: %"eventType=database/event/servicemetrics/online.home"

…and the ONS on the server respond in kind by sending us a status message (with an apparently empty message body)…

12:27:20.771967 IP dell11gr1.6200 > 192.168.1.110.47865: P 265:367(102) ack 187 win 54 
..[x..a.POST /status HTTP/1.1
Version: 4 
Content-Length: 0
SubscriberID: 1
Result: success
Message: 

…our client then informs ONS it is subscribing to any events for the database, service, or host…

12:27:24.442606 IP 192.168.1.110.47865 > dell11gr1.6200: P 187:339(152) ack 367 win 54 
..op..[xPOST /subscribe HTTP/1.1
Content-Length: 0
SubscriberID: 2
Subscription: (%"eventType=database/event/service")|(%"eventType=database/event/host")

…with another apparently empty message body. This goes on for the duration of the test…

12:27:24.443410 IP dell11gr1.6200 > 192.168.1.110.47865: P 367:469(102) ack 339 win 62 
..i...opPOST /status HTTP/1.1
Version: 4 
Content-Length: 0
SubscriberID: 2
Result: success
Message: 

We then found bug 9363352, which is related to Fast Connection Failover. However, it indicated that if the service_name was not created with the domain name appended (not just concatenated at runtime), FCF would not work. Apparently the same thing is true for load balancing.

We ended up dropping the service and recreating it in 11.2.0.2 with the domain name appended. Runtime connection load balancing then worked.

We first show that connection load balancing does not work as-is…

wctestdb01txwe:oracle:wcprod1:/home/oracle>java -Djava.util.logging.config.file=ucp.props testLB 10
URL = jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=(LOAD_BALANCE=ON)(FAILOVER=ON)(ADDRESS=(PROTOCOL=TCP)(HOST=wcdb01pxwe-vip.hidden.domain)(PORT=2484))(ADDRESS=(PROTOCOL=TCP)(HOST=wcdb02pxwe-vip.hidden.domain)(PORT=2484)))(CONNECT_DATA=(SERVICE_NAME=wcpfan.hidden.domain)))
nodes=wcdb01pxwe:6200,wcdb02pxwe:6200

Connection distribution:
{wcprod2=50, wcprod1=49}

Connection distribution:
{wcprod2=50, wcprod1=47}

Connection distribution:
{wcprod2=47, wcprod1=50}

Connection distribution:
{wcprod2=48, wcprod1=50}

Connection distribution:
{wcprod2=49, wcprod1=50}

Connection distribution:
{wcprod2=49, wcprod1=50}

Connection distribution:
{wcprod2=50, wcprod1=49}

Connection distribution:
{wcprod2=50, wcprod1=46}

Connection distribution:
{wcprod2=50, wcprod1=46}

Connection distribution:
{wcprod2=50, wcprod1=49}

Runtime Connection Load Balancing Statistics
-----------------------------------------------------
Failed RCLB connection gets:            500
Successful RCLB connection gets:        0

…and we then show the service does not have the domain name appended to it in the OCR…

10:53:32 oracle@wcdb01pxwe ~ >srvctl config service -d wcprod -s wcpfan
Service name: wcpfan
Service is enabled
Server pool: wcprod_wcpfan
Cardinality: 2
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: true
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: SHORT
Runtime Load Balancing Goal: SERVICE_TIME
TAF policy specification: NONE
Edition:
Preferred instances: wcprod1,wcprod2
Available instances:

…so we then stop, remove, and re-add the service with the domain name…

10:53:50 oracle@wcdb01pxwe ~ >srvctl stop service -d wcprod -s wcpfan

10:53:59 oracle@wcdb01pxwe ~ >srvctl remove service -d wcprod -s wcpfan
10:54:06 oracle@wcdb01pxwe ~ >srvctl add service -d wcprod -s wcpfan.hidden.domain -r wcprod1,wcprod2 -q TRUE -B SERVICE_TIME -j SHORT
10:54:51 oracle@wcdb01pxwe ~ >srvctl start service -d wcprod -s wcpfan.hidden.domain

…and see we now use the connection time load balancing algorithms…

wctestdb01txwe:oracle:wcprod1:/home/oracle>java -Djava.util.logging.config.file=ucp.props testLB 10
URL = jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=(LOAD_BALANCE=ON)(FAILOVER=ON)(ADDRESS=(PROTOCOL=TCP)(HOST=wcdb01pxwe-vip.hidden.domain)(PORT=2484))(ADDRESS=(PROTOCOL=TCP)(HOST=wcdb02pxwe-vip.hidden.domain)(PORT=2484)))(CONNECT_DATA=(SERVICE_NAME=wcpfan.hidden.domain)))
nodes=wcdb01pxwe:6200,wcdb02pxwe:6200

Connection distribution:
{wcprod2=50, wcprod1=49}

Connection distribution:
{wcprod2=50, wcprod1=48}

Connection distribution:
{wcprod2=49, wcprod1=50}

Connection distribution:
{wcprod2=49, wcprod1=50}

Connection distribution:
{wcprod2=50, wcprod1=49}

Connection distribution:
{wcprod2=50, wcprod1=45}

Connection distribution:
{wcprod2=50, wcprod1=46}

Connection distribution:
{wcprod2=50, wcprod1=47}

Connection distribution:
{wcprod2=50, wcprod1=48}

Connection distribution:
{wcprod2=50, wcprod1=41}

Runtime Connection Load Balancing Statistics
-----------------------------------------------------
Failed RCLB connection gets:            94
Successful RCLB connection gets:        406
wctestdb01txwe:oracle:wcprod1:/home/oracle>

We still had 94 connections out of 500 that didn’t use RCLB, but we consistently had zero before the change above, so we’ll dig into that as a separate issue. The JDBC drivers in 10.2.0.4 may have had the same issue (not every single connection using runtime load balancing), but we wouldn’t have known as those statistics are just now being exposed with the 11.2.0.2 UCP drivers.

This was a lot of grief for us, and Oracle Support was zero help, so hopefully this will help some poor lost soul who may wander by.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.