Friday, September 26, 2014

Prevent Firewall from Closing Idle Connections between App Server and Database Server (1)

If you see TNS timeout errors in your database alert logs like below, you lost the connection between your app server and database server. 

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=64656))

In my case, it was because the connection was idle and thus got closed out by firewall policy between my app server and database server. One way to prevent this from happening is to use TCP keepalive on my Linux app server to keep the connection active. Below shows how to implement this:


1. On OS level, 3 parameters are involved: tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes. What we need to deal with is the first parameter. We need to make tcp_keepalive_time smaller than the timeout value of firewall policy that opened the ports between app server and database server. By default the firewall timeout was 30 minutes, so we want to set  tcp_keepalive_time to less than 30 minutes, say 25 minutes. As root, edit /etc/sysctl.conf and add below line to the end:
net.ipv4.tcp_keepalive_time = 1500
then run “sysctl –p” to make it effective immediately and permanently.

2. Edit database connection string in tnsnames.ora file as oracle, add (ENABLE=BROKEN) in the description section, for example: DBServiceName =
  (DESCRIPTION =
    (ENABLE=BROKEN)
    (ADDRESS = (PROTOCOL = TCP)(HOST = YourDBServerIP)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME =  DBServiceName)
    )
  )

3. Bounce app server.

4     4. Run "netstat –ntop |grep ESTAB", make sure app server processes are now running with TCP Keep Alive enabled, The command should return things like below:

tcp        0      0 10.100.10.230:53802         10.100.10.234:1521          ESTABLISHED 5887/PSAPPSRV       keepalive (395.84/0/0)

If the processes are running without TCP Keep Alive enabled, we should see things like below:

tcp        0      0 10.100.10.230:22208         10.100.10.233:1521          ESTABLISHED 3874/PSAPPSRV       off (0.00/0/0)

Once this was done, the TNS timeout errors in the database logs disappeared. 

No comments: