Friday, September 26, 2014

Prevent Firewall from Closing Idle Connections between App Server and Database Server (2)

In my previous blog, I explained how to use tcp keepalive to prevent firewall from closing idle connection between app server and database server. Here I am going to explain another mechanism to accomplish the same goal, this mechanism is called oracle DCD (Dead Connection Detection). In fact this would be the preferred mechanism because it's simpler to setup, as long as you can confirm DCD packets are indeed sent out and firewall does recognize it as valid traffic, as Oracle note 257650.1 states "some later firewalls and updated firewall firmware may not see DCD packets as a valid traffic possibly because the packets that DCD sends are actually empty packets."

In one of our environments, knowing firewall has a 30 minutes timeout, we configured DCD to be sent out every 25 minutes by setting SQLNET.EXPIRE_TIME=25. However we still see TNS timeout errors in database alert logs described in my previous blog. This led me to think either DCD is not working, or our firewall doesn't recognize it as valid traffic, as Oracle note 257650.1 stated. So I went ahead and strace-ed oracle server process to confirm that DCD packets were indeed sent out (oracle note 438923.1), and had our security guy confirm that the DCD packets were recognized as valid traffic and passed through by firewall. So why are we still seeing TNS timeouts in the database alert logs? 

Then I found oracle note 395505.1 describing how the DCD is triggered and stated "the first DCD probe packet would go only after 2 * expire_time and successive one's would be sent every expiry_time provided no activity in that next span too", I then strace-ed again and confirmed this statement. Now I know why we are still seeing TNS timeouts: it's because if the client makes a connection to the database server and stays idle for 30 minutes, firewall would close it because no DCD packet has yet been sent. So to make sure the first DCD packet is sent within first 30 minutes of connection, SQLNET.EXPIRE_TIME has to be set to 30/2-1=14. To prove this, I set SQLNET.EXPIRE_TIME=14, TNS timeout errors disappeared from alert logs, then I set SQLNET.EXPIRE_TIME=16, TNS timeout error re-occurred. 

In conclusion, if you have firewall policy with timeout setting of x minutes, you need to at least set SQLNET.EXPIRE_TIME = x/2 - 1 or smaller.


Prevent Firewall from Closing Idle Connections between App Server and Database Server (1)

If you see TNS timeout errors in your database alert logs like below, you lost the connection between your app server and database server. 

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=64656))

In my case, it was because the connection was idle and thus got closed out by firewall policy between my app server and database server. One way to prevent this from happening is to use TCP keepalive on my Linux app server to keep the connection active. Below shows how to implement this:


1. On OS level, 3 parameters are involved: tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes. What we need to deal with is the first parameter. We need to make tcp_keepalive_time smaller than the timeout value of firewall policy that opened the ports between app server and database server. By default the firewall timeout was 30 minutes, so we want to set  tcp_keepalive_time to less than 30 minutes, say 25 minutes. As root, edit /etc/sysctl.conf and add below line to the end:
net.ipv4.tcp_keepalive_time = 1500
then run “sysctl –p” to make it effective immediately and permanently.

2. Edit database connection string in tnsnames.ora file as oracle, add (ENABLE=BROKEN) in the description section, for example: DBServiceName =
  (DESCRIPTION =
    (ENABLE=BROKEN)
    (ADDRESS = (PROTOCOL = TCP)(HOST = YourDBServerIP)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME =  DBServiceName)
    )
  )

3. Bounce app server.

4     4. Run "netstat –ntop |grep ESTAB", make sure app server processes are now running with TCP Keep Alive enabled, The command should return things like below:

tcp        0      0 10.100.10.230:53802         10.100.10.234:1521          ESTABLISHED 5887/PSAPPSRV       keepalive (395.84/0/0)

If the processes are running without TCP Keep Alive enabled, we should see things like below:

tcp        0      0 10.100.10.230:22208         10.100.10.233:1521          ESTABLISHED 3874/PSAPPSRV       off (0.00/0/0)

Once this was done, the TNS timeout errors in the database logs disappeared.