Monday, July 6, 2009

Troubleshooting & Important CRSCTL commands

Note#
Any command that has to query information regarding Cluster can be run as oracle user whereas any command which changes the configuration of cluster has to be run from root user.

Start CRS
==========
crsctl start crs
init.crs start
crs_start -all(as oracle)

Stop CRS
==========
crsctl stop crs (as root)
crs_stop -all (as oracle)
init.crs stop (as root) Check CRS

==========
crsctl check crs

Enable Oracle Clusterware
===================
crsctl enable crs

Disable Oracle Clusterware
====================
crsctl disable crs

Check location of Voting Disk
=====================
crsctl query css votedisk
0 /oracrs/oradata/data01/vdisk1
0 /oracrs/oradata/data01/vdisk2
0 /oracrs/oradata/data01/vdisk3

located 3 votedisk(s).

Check location of OCR Disk
=====================
$CRS_ORACLE_HOME/bin/ocrcheck

Status of Oracle Cluster Registry is as follows :
Version : 2
Total space (kbytes) : 262120
Used space (kbytes) : 2024
Available space (kbytes) : 260096
ID : 2110452402
Device/File Name : /oracrs/oradata/data01/ocrdisk1
Device/File integrity check succeeded
Device/File Name : /oracrs/oradata/data02/ocrdisk2

Device/File integrity check succeeded
Cluster registry integrity check succeeded
======================================

All the clusterware processes are normally retrieved via OS commands like:
ps -ef grep -E 'initd.binoclssleepevmloggeroprocddiskmonPID'

There are general processes, i.e. processes that need to be started on all platforms/releases
and specific processes, i.e. processes that need to be started on some CRS versions/platforms
A.)
The general processes are
ocssd.bin
evmd.bin
evmlogger.bin
crsd.bin
B)
The specific processes are

oprocd: run on Unix when vendor Clusterware is not running. On Linux, only starting with
10.2.0.4.
oclsvmon.bin: normally run when a third party clusterware is running
oclsomon.bin: check program of the ocssd.bin (starting in 10.2.0.1)
diskmon.bin: new 11.1.0.7 process for exadata
oclskd.bin: new 11.1.0.6 process to reboot nodes in case rdbms instances are hanging

There are three fatal processes, i.e. processes whose abnormal halt or kill will provoque a node
reboot (see note:265769.1):
1. the ocssd.bin
2. the oprocd.bin
3. the oclsomon.bin


The other processes are automatically restarted when they go away.


When the clusterware is not allowed to start on boot
This state is reached when:
1. 'crsctl stop crs' has been issued and the clusterware is stopped
or
2. the automatic startup of the clusterware has been disabled and the node has been rebooted, e.g.
./init.crs disable
Automatic startup disabled for system boot.

The 'ps' command only show the three inittab processes with spawned sleeping processes in a
30seconds loop

ps -ef grep -E 'initd.binoclsoprocddiskmonevmloggersleepPID'

UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:55 ? 00:00:00 init [5]
root 19770 1 0 18:00 ? 00:00:00 /bin/sh /etc/init.d/init.evmd run
root 19854 1 0 18:00 ? 00:00:00 /bin/sh /etc/init.d/init.crsd run
root 19906 1 0 18:00 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root 22143 19770 0 18:02 ? 00:00:00 /bin/sleep 30
root 22255 19854 0 18:02 ? 00:00:00 /bin/sleep 30
root 22266 19906 0 18:02 ? 00:00:00 /bin/sleep 30

The clusterware can be reenabled via './init.crs enable' execution or/and via 'crsctl start crs'


When the clusterware is allowed to start on boot, but can't start because some
prerequisites are not met


This state is reached when the node has reboot and some prerequisites are missing, e.g.

1. OCR is not accessible
2. Cluster interconnect can't accept tcp connections
3. CRS_HOME is not mounted

How to Troubleshoot?
'crsctl check boot' (run as oracle) show errors, e.g.
$ crsctl check boot
Oracle Cluster Registry initialization failed accessing Oracle Cluster Registry device:
PROC-26: Error while accessing the physical storage Operating System error [No such file or
directory] [2]

The three inittab processes are sleeping for 60seconds in a loop in 'init.cssd startcheck'
ps -ef grep -E 'initd.binoclsoprocddiskmonevmloggersleepPID'
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 18:28 ? 00:00:00 init [5]
root 4969 1 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.evmd run
root 5060 1 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root 5064 1 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.crsd run
root 5405 4969 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root 5719 5060 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root 5819 5064 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root 6986 5405 0 18:30 ? 00:00:00 /bin/sleep 60
root 6987 5819 0 18:30 ? 00:00:00 /bin/sleep 60
root 7025 5719 0 18:30 ? 00:00:00 /bin/sleep 60

Once the 'crsctl check boot' will return nothing (no error messages anymore), then the clusterware processes will start.

Note#
CRS is designed to run at level 3 or 5 (GUI).
cat /etc/inittab.