Existing hadoop cluster: hadoop1-6, decommissioning a data node: hadoop5.
1. Edit hdfs-site.xml on hadoop1 to include:
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/hadoop/etc/hadoop/dfs.exclude</value>
</property>
2. Edit /home/hadoop/hadoop/etc/hadoop/dfs.exclude to add a line:
hadoop5
3. Run: distribute-exclude.sh dfs.exclude
4. Run: refresh-namenodes.sh
Friday, April 22, 2016
Thursday, April 21, 2016
Adding Data Node to an Existing Hadoop Cluster
Existing hadoop cluster: hadoop1-5, adding a new data node: hadoop6.
1. Clone an existing data node VM to hadoop6.
2. Edit /etc/hosts file to include hadoop6 ip address and hostname, then copy to the rest nodes in the cluster.
3. Edit slaves file to include hadoop6 hostname, then copy to the rest nodes in the cluster.
4. Delete HADOOP_DATA_DIR on hadoop6.
5. Start data node on hadoop6.
hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
6. Balance data node
hdfs balancer
1. Clone an existing data node VM to hadoop6.
2. Edit /etc/hosts file to include hadoop6 ip address and hostname, then copy to the rest nodes in the cluster.
3. Edit slaves file to include hadoop6 hostname, then copy to the rest nodes in the cluster.
4. Delete HADOOP_DATA_DIR on hadoop6.
5. Start data node on hadoop6.
hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
6. Balance data node
hdfs balancer
Monday, September 14, 2015
OPatch failed with error code = 41 on Windows 2008 R2
I ran into this error when I was installing patch 17 to 11.2.0.3 on Windows 2008 R2. Recommended actions: OPatch needs to modify files which are being used by some processes. Even though I stopped all of the Oracle related services, it still failed with this same error. Here is how to resolve this issue:
1. Reboot server to safe mode
2. Rename ORACLE_HOME directory to something else
3. Reboot server to normal mode
4. Rename ORACLE_HOME directory to what it was
5. Apply patch as usual.
1. Reboot server to safe mode
2. Rename ORACLE_HOME directory to something else
3. Reboot server to normal mode
4. Rename ORACLE_HOME directory to what it was
5. Apply patch as usual.
Tuesday, August 25, 2015
Set up TDE in 12c RAC
1. In 12c RAC, keystore location can be in either ASM or ACFS filesystem. We are using ASM in this setup:
Edit RDBMS HOME's sqlnet.ora to add below lines:
ENCRYPTION_WALLET_LOCATION=
(SOURCE=
(METHOD=FILE)
(METHOD_DATE=
(DIRECTORY=+PSDATA/WALLET/$ORACLE_UNQNAME/)))
2. Set ORACLE_UNQNAME in .bash_profile:
In .bash_profile, set ORACLE_UNQNAME after ORACLE_ID is set:
export ORACLE_UNQNAME=`$ORACLE_HOME/bin/srvctl config database |grep -w ${ORACLE_SID%?}`
3. Set ORACLE_UNQNAME in CRS:
Set ORACLE_UNQNAME in CRS, otherwise v$encryption_wallet and gv$encryption_wallet show different information.
srvctl setenv database -d TDEDEMO -T "ORACLE_UNQNAME=TDEDEMO"
4. Create the keystore:
SQL> administer key management create keystore '+PSDATA/WALLET/TDEDEMO/' identified by "password1";
keystore altered.
5. Open the keystore:
SQL> administer key management set keystore open identified by "password1";
keystore altered.
6. Create the master key:
SQL> administer key management create key identified by "password1" with backup using 'TDEDEMO';
keystore altered.
7. Activate the master key:
SQL> select key_id from v$encryption_keys;
KEY_ID
------------------------------------------------------------------------------
AazYtFb200+Nv7T4i/i5e4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
SQL> administer key management use key 'AazYtFb200+Nv7T4i/i5e4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' identified by "password1" with backup using 'TDEDEMO';
keystore altered.
8. Backup the keystore (can't backup from ASM to non-ASM, or vice versa, with error ORA-46620: backup for the keystore cannot be taken)
SQL> administer key management backup keystore using 'TDEDEMO' identified by "password1" to '/psoft/backup/wallets/';
administer key management backup keystore using 'TDEDEMO' identified by "password1" to '/psoft/backup/wallets/'
*
ERROR at line 1:
ORA-46620: backup for the keystore cannot be taken
SQL> administer key management backup keystore using 'TDEDEMO' identified by "password1" to '+PSFLASH/WALLET/TDEDEMO/';
keystore altered.
9. Create an auto_login keystore:
administer key management create auto_login keystore from keystore '+PSDATA/WALLET/TDEDEMO/' identified by "password1";
Edit RDBMS HOME's sqlnet.ora to add below lines:
ENCRYPTION_WALLET_LOCATION=
(SOURCE=
(METHOD=FILE)
(METHOD_DATE=
(DIRECTORY=+PSDATA/WALLET/$ORACLE_UNQNAME/)))
2. Set ORACLE_UNQNAME in .bash_profile:
In .bash_profile, set ORACLE_UNQNAME after ORACLE_ID is set:
export ORACLE_UNQNAME=`$ORACLE_HOME/bin/srvctl config database |grep -w ${ORACLE_SID%?}`
3. Set ORACLE_UNQNAME in CRS:
Set ORACLE_UNQNAME in CRS, otherwise v$encryption_wallet and gv$encryption_wallet show different information.
srvctl setenv database -d TDEDEMO -T "ORACLE_UNQNAME=TDEDEMO"
4. Create the keystore:
SQL> administer key management create keystore '+PSDATA/WALLET/TDEDEMO/' identified by "password1";
keystore altered.
5. Open the keystore:
SQL> administer key management set keystore open identified by "password1";
keystore altered.
6. Create the master key:
SQL> administer key management create key identified by "password1" with backup using 'TDEDEMO';
keystore altered.
7. Activate the master key:
SQL> select key_id from v$encryption_keys;
KEY_ID
------------------------------------------------------------------------------
AazYtFb200+Nv7T4i/i5e4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
SQL> administer key management use key 'AazYtFb200+Nv7T4i/i5e4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' identified by "password1" with backup using 'TDEDEMO';
keystore altered.
8. Backup the keystore (can't backup from ASM to non-ASM, or vice versa, with error ORA-46620: backup for the keystore cannot be taken)
SQL> administer key management backup keystore using 'TDEDEMO' identified by "password1" to '/psoft/backup/wallets/';
administer key management backup keystore using 'TDEDEMO' identified by "password1" to '/psoft/backup/wallets/'
*
ERROR at line 1:
ORA-46620: backup for the keystore cannot be taken
SQL> administer key management backup keystore using 'TDEDEMO' identified by "password1" to '+PSFLASH/WALLET/TDEDEMO/';
keystore altered.
9. Create an auto_login keystore:
administer key management create auto_login keystore from keystore '+PSDATA/WALLET/TDEDEMO/' identified by "password1";
Wednesday, November 26, 2014
Change 12cR1 Public IP/VIP/SCAN
Old IPs:
SCAN 10.100.10.235
10.100.10.236
10.100.10.237
Hosts: 10.100.10.231
10.100.10.232
VIPs: 10.100.10.233
10.100.10.234
New IPs:
SCAN 10.100.101.125
10.100.101.126
10.100.101.127
Hosts: 10.100.101.121
10.100.101.122
VIPs: 10.100.101.123
10.100.101.124
Step 1: As grid: (on first node only)
oifcfg delif -global eth0/10.100.10.0
oifcfg setif -global eth0/10.100.101.0:public
Step 2: As grid:
[grid@psdb1 ~]$ srvctl config nodeapps -a :verify the current config
Network 1 exists
Subnet IPv4: 10.100.10.0/255.255.255.0/eth0, static
Subnet IPv6:
VIP exists: network number 1, hosting node psdb1
VIP Name: psdb1-vip.lab.hsc.net.ou.edu
VIP IPv4 Address: 10.100.10.233
VIP IPv6 Address:
VIP exists: network number 1, hosting node psdb2
VIP Name: psdb2-vip.lab.hsc.net.ou.edu
VIP IPv4 Address: 10.100.10.234
VIP IPv6 Address:
srvctl stop vip -n psdb1 -f
srvctl relocate scan -scannumber 1 -node psdb2 (relocate scan to psdb2)
srvctl relocate scan -scannumber 2 -node psdb2 (relocate scan to psdb2)
srvctl relocate scan -scannumber 3 -node psdb2 (relocate scan to psdb2)
Step 3: as root:
vi /etc/sysconfig/network-scripts/ifcfg-eth0, change IP/GATEWAY
Go to VM console:
ifdown eth0
ifup eth0
Change NIC to the new VLAN
ssh to the new IP
change hostname/hostname-vip in DNS
srvctl modify network -k 1 -S 10.100.101.0/255.255.255.0/eth0 (on first node only)
srvctl modify nodeapps -n psdb1 -A psdb1-vip/255.255.255.0/eth0
srvctl config nodeapps -a : verify changes
Step 4: as grid:
srvctl start vip -n psdb1 (databases still up and servicing from psdb2)
step 5: as root: (on first node)
srvctl stop scan_listener
srvctl stop scan
srvctl status scan :verify scan is off
srvctl status scan_listener: verify scan_listener is off
change SCAN IPs in DNS
srvctl config scan: shows old SCAN IPs
srvctl modify scan -n psdb-scan
Step 6: Repeat Step 2 to Step 4 for the rest nodes in the cluster
Step 7: as root:
srvctl start scan
srvctl start scan_listener
step 8: Rolling bounce all databases
SCAN 10.100.10.235
10.100.10.236
10.100.10.237
Hosts: 10.100.10.231
10.100.10.232
VIPs: 10.100.10.233
10.100.10.234
New IPs:
SCAN 10.100.101.125
10.100.101.126
10.100.101.127
Hosts: 10.100.101.121
10.100.101.122
VIPs: 10.100.101.123
10.100.101.124
Step 1: As grid: (on first node only)
oifcfg delif -global eth0/10.100.10.0
oifcfg setif -global eth0/10.100.101.0:public
Step 2: As grid:
[grid@psdb1 ~]$ srvctl config nodeapps -a :verify the current config
Network 1 exists
Subnet IPv4: 10.100.10.0/255.255.255.0/eth0, static
Subnet IPv6:
VIP exists: network number 1, hosting node psdb1
VIP Name: psdb1-vip.lab.hsc.net.ou.edu
VIP IPv4 Address: 10.100.10.233
VIP IPv6 Address:
VIP exists: network number 1, hosting node psdb2
VIP Name: psdb2-vip.lab.hsc.net.ou.edu
VIP IPv4 Address: 10.100.10.234
VIP IPv6 Address:
srvctl stop vip -n psdb1 -f
srvctl relocate scan -scannumber 1 -node psdb2 (relocate scan to psdb2)
srvctl relocate scan -scannumber 2 -node psdb2 (relocate scan to psdb2)
srvctl relocate scan -scannumber 3 -node psdb2 (relocate scan to psdb2)
Step 3: as root:
vi /etc/sysconfig/network-scripts/ifcfg-eth0, change IP/GATEWAY
Go to VM console:
ifdown eth0
ifup eth0
Change NIC to the new VLAN
ssh to the new IP
change hostname/hostname-vip in DNS
srvctl modify network -k 1 -S 10.100.101.0/255.255.255.0/eth0 (on first node only)
srvctl modify nodeapps -n psdb1 -A psdb1-vip/255.255.255.0/eth0
srvctl config nodeapps -a : verify changes
Step 4: as grid:
srvctl start vip -n psdb1 (databases still up and servicing from psdb2)
step 5: as root: (on first node)
srvctl stop scan_listener
srvctl stop scan
srvctl status scan :verify scan is off
srvctl status scan_listener: verify scan_listener is off
change SCAN IPs in DNS
srvctl config scan: shows old SCAN IPs
srvctl modify scan -n psdb-scan
Step 6: Repeat Step 2 to Step 4 for the rest nodes in the cluster
Step 7: as root:
srvctl start scan
srvctl start scan_listener
step 8: Rolling bounce all databases
Friday, September 26, 2014
Prevent Firewall from Closing Idle Connections between App Server and Database Server (2)
In my previous blog, I explained how to use tcp keepalive to prevent firewall from closing idle connection between app server and database server. Here I am going to explain another mechanism to accomplish the same goal, this mechanism is called oracle DCD (Dead Connection Detection). In fact this would be the preferred mechanism because it's simpler to setup, as long as you can confirm DCD packets are indeed sent out and firewall does recognize it as valid traffic, as Oracle note 257650.1 states "some later firewalls and updated firewall firmware may not see DCD packets as a valid traffic possibly because the packets that DCD sends are actually empty packets."
In one of our environments, knowing firewall has a 30 minutes timeout, we configured DCD to be sent out every 25 minutes by setting SQLNET.EXPIRE_TIME=25. However we still see TNS timeout errors in database alert logs described in my previous blog. This led me to think either DCD is not working, or our firewall doesn't recognize it as valid traffic, as Oracle note 257650.1 stated. So I went ahead and strace-ed oracle server process to confirm that DCD packets were indeed sent out (oracle note 438923.1), and had our security guy confirm that the DCD packets were recognized as valid traffic and passed through by firewall. So why are we still seeing TNS timeouts in the database alert logs?
Then I found oracle note 395505.1 describing how the DCD is triggered and stated "the first DCD probe packet would go only after 2 * expire_time and successive one's would be sent every expiry_time provided no activity in that next span too", I then strace-ed again and confirmed this statement. Now I know why we are still seeing TNS timeouts: it's because if the client makes a connection to the database server and stays idle for 30 minutes, firewall would close it because no DCD packet has yet been sent. So to make sure the first DCD packet is sent within first 30 minutes of connection, SQLNET.EXPIRE_TIME has to be set to 30/2-1=14. To prove this, I set SQLNET.EXPIRE_TIME=14, TNS timeout errors disappeared from alert logs, then I set SQLNET.EXPIRE_TIME=16, TNS timeout error re-occurred.
In conclusion, if you have firewall policy with timeout setting of x minutes, you need to at least set SQLNET.EXPIRE_TIME = x/2 - 1 or smaller.
In one of our environments, knowing firewall has a 30 minutes timeout, we configured DCD to be sent out every 25 minutes by setting SQLNET.EXPIRE_TIME=25. However we still see TNS timeout errors in database alert logs described in my previous blog. This led me to think either DCD is not working, or our firewall doesn't recognize it as valid traffic, as Oracle note 257650.1 stated. So I went ahead and strace-ed oracle server process to confirm that DCD packets were indeed sent out (oracle note 438923.1), and had our security guy confirm that the DCD packets were recognized as valid traffic and passed through by firewall. So why are we still seeing TNS timeouts in the database alert logs?
Then I found oracle note 395505.1 describing how the DCD is triggered and stated "the first DCD probe packet would go only after 2 * expire_time and successive one's would be sent every expiry_time provided no activity in that next span too", I then strace-ed again and confirmed this statement. Now I know why we are still seeing TNS timeouts: it's because if the client makes a connection to the database server and stays idle for 30 minutes, firewall would close it because no DCD packet has yet been sent. So to make sure the first DCD packet is sent within first 30 minutes of connection, SQLNET.EXPIRE_TIME has to be set to 30/2-1=14. To prove this, I set SQLNET.EXPIRE_TIME=14, TNS timeout errors disappeared from alert logs, then I set SQLNET.EXPIRE_TIME=16, TNS timeout error re-occurred.
In conclusion, if you have firewall policy with timeout setting of x minutes, you need to at least set SQLNET.EXPIRE_TIME = x/2 - 1 or smaller.
Prevent Firewall from Closing Idle Connections between App Server and Database Server (1)
If you see TNS timeout errors in your database alert logs like below, you lost the connection between your app server and database server.
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=64656))
In my case, it was because the connection was idle and thus got closed out by firewall policy between my app server and database server. One way to prevent this from happening is to use TCP keepalive on my Linux app server to keep the connection active. Below shows how to implement this:
1. On OS level, 3 parameters are involved: tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes. What we need to deal with is the first parameter. We need to make tcp_keepalive_time smaller than the timeout value of firewall policy that opened the ports between app server and database server. By default the firewall timeout was 30 minutes, so we want to set tcp_keepalive_time to less than 30 minutes, say 25 minutes. As root, edit /etc/sysctl.conf and add below line to the end:
net.ipv4.tcp_keepalive_time = 1500
then run “sysctl –p” to make it effective immediately and permanently.
3. Bounce app server.
tcp 0 0 10.100.10.230:53802 10.100.10.234:1521 ESTABLISHED 5887/PSAPPSRV keepalive (395.84/0/0)
tcp 0 0 10.100.10.230:22208 10.100.10.233:1521 ESTABLISHED 3874/PSAPPSRV
off (0.00/0/0)
Once this was done, the TNS timeout errors in the database logs disappeared.
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=
In my case, it was because the connection was idle and thus got closed out by firewall policy between my app server and database server. One way to prevent this from happening is to use TCP keepalive on my Linux app server to keep the connection active. Below shows how to implement this:
1. On OS level, 3 parameters are involved: tcp_keepalive_time, tcp_keepalive_intvl, and tcp_keepalive_probes. What we need to deal with is the first parameter. We need to make tcp_keepalive_time smaller than the timeout value of firewall policy that opened the ports between app server and database server. By default the firewall timeout was 30 minutes, so we want to set tcp_keepalive_time to less than 30 minutes, say 25 minutes. As root, edit /etc/sysctl.conf and add below line to the end:
net.ipv4.tcp_keepalive_time = 1500
then run “sysctl –p” to make it effective immediately and permanently.
2. Edit database connection string in tnsnames.ora file as oracle, add (ENABLE=BROKEN) in the description section, for example: DBServiceName =
(DESCRIPTION =
(ENABLE=BROKEN)
(ADDRESS = (PROTOCOL = TCP)(HOST = YourDBServerIP)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = DBServiceName)
)
)
4 4. Run "netstat –ntop |grep ESTAB", make
sure app server processes are now running with TCP Keep Alive enabled,
The command should return things like below:
tcp 0 0 10.100.10.230:53802 10.100.10.234:1521 ESTABLISHED 5887/PSAPPSRV keepalive (395.84/0/0)
If the processes are running without TCP Keep Alive enabled, we should see things like below:
Once this was done, the TNS timeout errors in the database logs disappeared.
Subscribe to:
Posts (Atom)