In my previous blog, I explained how to use tcp keepalive to prevent firewall from closing idle connection between app server and database server. Here I am going to explain another mechanism to accomplish the same goal, this mechanism is called oracle DCD (Dead Connection Detection). In fact this would be the preferred mechanism because it's simpler to setup, as long as you can confirm DCD packets are indeed sent out and firewall does recognize it as valid traffic, as Oracle note 257650.1 states "some later firewalls and updated firewall firmware may not see DCD packets as a valid traffic possibly because the packets that DCD sends are actually empty packets."
In one of our environments, knowing firewall has a 30 minutes timeout, we configured DCD to be sent out every 25 minutes by setting SQLNET.EXPIRE_TIME=25. However we still see TNS timeout errors in database alert logs described in my previous blog. This led me to think either DCD is not working, or our firewall doesn't recognize it as valid traffic, as Oracle note 257650.1 stated. So I went ahead and strace-ed oracle server process to confirm that DCD packets were indeed sent out (oracle note 438923.1), and had our security guy confirm that the DCD packets were recognized as valid traffic and passed through by firewall. So why are we still seeing TNS timeouts in the database alert logs?
Then I found oracle note 395505.1 describing how the DCD is triggered and stated "the first DCD probe packet would go only after 2 * expire_time and
successive one's would be sent every expiry_time provided no activity in
that next span too", I then strace-ed again and confirmed this statement. Now I know why we are still seeing TNS timeouts: it's because if the client makes a connection to the database server and stays idle for 30 minutes, firewall would close it because no DCD packet has yet been sent. So to make sure the first DCD packet is sent within first 30 minutes of connection, SQLNET.EXPIRE_TIME has to be set to 30/2-1=14. To prove this, I set SQLNET.EXPIRE_TIME=14, TNS timeout errors disappeared from alert logs, then I set SQLNET.EXPIRE_TIME=16, TNS timeout error re-occurred.
In conclusion, if you have firewall policy with timeout setting of x minutes, you need to at least set SQLNET.EXPIRE_TIME = x/2 - 1 or smaller.
Friday, September 26, 2014
Prevent Firewall from Closing Idle Connections between App Server and Database Server (1)
If you see TNS timeout errors in your database alert logs like below, you lost the connection between your app server and database server.
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=64656))
In my case, it was because the connection was idle and thus got closed out by firewall policy between my app server and database server. One way to prevent this from happening is to use TCP keepalive on my Linux app server to keep the connection active. Below shows how to implement this:
1. On OS level, 3 parameters are involved: tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes. What we need to deal with is the first parameter. We need to make tcp_keepalive_time smaller than the timeout value of firewall policy that opened the ports between app server and database server. By default the firewall timeout was 30 minutes, so we want to set tcp_keepalive_time to less than 30 minutes, say 25 minutes. As root, edit /etc/sysctl.conf and add below line to the end:
net.ipv4.tcp_keepalive_time = 1500
then run “sysctl –p” to make it effective immediately and permanently.
3. Bounce app server.
tcp 0 0 10.100.10.230:53802 10.100.10.234:1521 ESTABLISHED 5887/PSAPPSRV keepalive (395.84/0/0)
tcp 0 0 10.100.10.230:22208 10.100.10.233:1521 ESTABLISHED 3874/PSAPPSRV
off (0.00/0/0)
Once this was done, the TNS timeout errors in the database logs disappeared.
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=
In my case, it was because the connection was idle and thus got closed out by firewall policy between my app server and database server. One way to prevent this from happening is to use TCP keepalive on my Linux app server to keep the connection active. Below shows how to implement this:
1. On OS level, 3 parameters are involved: tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes. What we need to deal with is the first parameter. We need to make tcp_keepalive_time smaller than the timeout value of firewall policy that opened the ports between app server and database server. By default the firewall timeout was 30 minutes, so we want to set tcp_keepalive_time to less than 30 minutes, say 25 minutes. As root, edit /etc/sysctl.conf and add below line to the end:
net.ipv4.tcp_keepalive_time = 1500
then run “sysctl –p” to make it effective immediately and permanently.
2. Edit database connection string in tnsnames.ora file as oracle, add (ENABLE=BROKEN) in the description section, for example: DBServiceName =
(DESCRIPTION =
(ENABLE=BROKEN)
(ADDRESS = (PROTOCOL = TCP)(HOST = YourDBServerIP)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = DBServiceName)
)
)
4 4. Run "netstat –ntop |grep ESTAB", make
sure app server processes are now running with TCP Keep Alive enabled,
The command should return things like below:
tcp 0 0 10.100.10.230:53802 10.100.10.234:1521 ESTABLISHED 5887/PSAPPSRV keepalive (395.84/0/0)
If the processes are running without TCP Keep Alive enabled, we should see things like below:
Once this was done, the TNS timeout errors in the database logs disappeared.
Monday, June 16, 2014
ORA-12514 while tnsping works okay
I have a 2-node 12cR1 RAC cluster that has been running great. Last Friday we replaced our juniper firewall with palo alto firewall, then I rebooted the RAC nodes. When I tried to start up my application server, it failed to start. I then noticed I couldn't connect to my database with error of ORA-12514, even though tnsping works fine. Initially we thought the firewall change might have broken it because all was working great before the firewall change and server reboot. However, since tnsping works fine, it means there is nothing blocking the application server from reaching to the database through the defined port. puzzles. puzzles. Then I noticed the remote_listener parameter was empty, which led me to look at sqlnet.ora file, and found EZCONNECT was no longer configured in NAMES.DIRECTORY_PATH. At this point, I remember that I copied a set of sqlnet.ora and ldap.ora file from a different server so that I can use LDAP to replace tnsnames.ora file for database services repository a few weeks ago. Apparently this new sqlnet.ora file doesn't have EZCONNECT configured in it. It didn't break anything until the RAC nodes rebooted, which coincided with the firewall change.
Thursday, March 13, 2014
Oracle Unbreakable Enterprise Kernel and PeopleSoft
I have been building a PeopleSoft HCM 9.2 demo environment on PeopleTools 8.53.11 this past week. Since Oracle Linux is certified with PeopleSoft as well, I chose to use it for my OS. Everything went smooth, as I have done this kind of build many times. Got the app server and web server started, I was eager to get to the sign on page to try to login. However, the sign on page took a very long time (more than 30 minutes) to show up, with an error message: "CHECK APPSERVER LOGS. THE SITE BOOTED WITH INTERNAL DEFAULT SETTINGS, BECAUSE of: bea.jolt.ServiceException:bea.jolt.JoltRemoteService(.GETALL)call():Timeout\nbea.jolt.SessionException:Connection recv error\nbea.jolt.JoltException:[1]NwHddlr.recv():Timeout Error". Looked through all the app server logs and didn't find anything out of ordinary. Google showed no hits. Another strange thing was that the app server wouldn't shut down, even right after it's just brought up, it always hangs at the second process:
Shutting down server processes ...
Server Id = 250 Group Id = JREPGRP Machine = xxx.xxx.xxx: shutdown succeeded
Server Id = 200 Group Id = JSLGRP Machine = xxx.xxx.xxx:
Opened a case with oracle support, not much help. Compared all the settings with our other working environments, not much difference. Really bothered me why it's not working. As a last effort, I switched back to boot with the RedHat Compatible Kernel, and issue disappeared! The problem kernel in my case is: 3.8.13-26.2.1.el6uek.x86_64. I would have never imaged that the unbreakable kernel would cause this issue, as I just built an Oracle 12c RAC with this kernel and my HCM 9.2 demo database has been running on this 12c RAC very well.
Shutting down server processes ...
Server Id = 250 Group Id = JREPGRP Machine = xxx.xxx.xxx: shutdown succeeded
Server Id = 200 Group Id = JSLGRP Machine = xxx.xxx.xxx:
Opened a case with oracle support, not much help. Compared all the settings with our other working environments, not much difference. Really bothered me why it's not working. As a last effort, I switched back to boot with the RedHat Compatible Kernel, and issue disappeared! The problem kernel in my case is: 3.8.13-26.2.1.el6uek.x86_64. I would have never imaged that the unbreakable kernel would cause this issue, as I just built an Oracle 12c RAC with this kernel and my HCM 9.2 demo database has been running on this 12c RAC very well.
Wednesday, July 24, 2013
scsi_id returns nothing on OEL6 running on VMware
Today while I was trying to setup UDEV for an Oracle 12cR1 RAC environment, I found scsi_id returns nothing. After googling around, the solution is as below:
1. shutdown the VM
2. right click the VM, then left click 'Edit Settings'
3. click 'Options' tab
4. click on 'General', then 'Configuration Parameters'
5. click 'Add Row'
6. add a parameter 'disk.EnableUUID' and set it to 'True', click 'OK'
boot up the server and now scsi_id -gud /dev/sdx returns values.
1. shutdown the VM
2. right click the VM, then left click 'Edit Settings'
3. click 'Options' tab
4. click on 'General', then 'Configuration Parameters'
5. click 'Add Row'
6. add a parameter 'disk.EnableUUID' and set it to 'True', click 'OK'
boot up the server and now scsi_id -gud /dev/sdx returns values.
Tuesday, April 2, 2013
Virus Scan of Documents Uploaded to PeopleSoft
Lately I was assigned a task to configure PeopleSoft Web Server with McAfee VirusScan Enterprise for Linux (which we already owned). It turned out that McAfee VSEL doesn't support ICAP, therefore it won't work with PeopleSoft. Then I was able to download a "Symantec Protection Engine for Cloud
Services" trial version and get it to work with weblogic web server. The
sample VirusScan.xml file that PeopleSoft provided with PIA installation worked
like a charm in a tools 8.52 environment.
Friday, January 25, 2013
PeopleTools 8.52 and libpsio_dir.so
I recently upgraded our SA and HRMS's PeopleTools from
8.50.20 to 8.52.10, the upgrade itself was pretty straight forward and went
smoothly. However, it broke the LDAP authentication. When you go to
"PeopleTools -> Security -> Directory -> Configure
Directory", click the "search" button, click the Directory ID,
then click the "Test Connectivity" tab, it just hangs there. Under
the "Directory Setup" tab, all settings were confirmed correct. After
some research, it turns out the problem was PS_HOME/bin/libpsio_dir.so. It
appears PeopleTools 8.52 now delivers libpsio_dir.so under
PS_HOME/bin/interfacedrivers, if the old
8.50.20 PS_HOME/bin/ libpsio_dir.so still exists, it will get called
instead of the new 8.52.10 PS_HOME/bin/interfacedrivers/libpsio_dir.so, thus
causing the "hang" issue. So remember to remove the old
libpsio_dir.so after the upgrade is done.
Subscribe to:
Posts (Atom)