我单位有rac ,linux as 5 , 使用asm, raw 存储设备 。 其中4个节点data1-data4, 昨天0:00出现 data instance 死掉。随后, data1,data4 相继 down .
在data 1 ,data 4 的 alertlog中看到,检测到死instance, global resource directory frozen, reconfigure node 的信息。
现在, 重启机器 crs 不能自动启动,手动在data1,data4 上运行 crsctl start crs,和 crsctl start resource 均没有反应。看了 ocssd,和 crs 的log , crs 在等ocssd,而ocssd 没起来,而 data 1 和data4 上的occsd.log 报告 好象提示两机器互锁。
下面贴出相关信息。
1. data 4 db instance shutdown 时的alertlog
Reconfiguration complete
Sat Apr 18 06:13:46 2009
Error: unexpected error (6) from the Cluster Service (LCK0)
Sat Apr 18 06:13:46 2009
Errors in file /opt/oracle/admin/orcl/bdump/orcl4_lck0_7823.trc:
ORA-29702: ´Ø×é·þÎñ²Ù×÷ÖóöÃÖ´ÃÎó
LCK0: terminating instance due to error 29702
Sat Apr 18 06:13:46 2009
System state dump is made for local instance
System State dumped to trace file /opt/oracle/admin/orcl/bdump/orcl4_diag_7559.trc
Sat Apr 18 06:13:47 2009
Trace dumping is performing id=[cdmp_20090418061346]
Sat Apr 18 06:13:50 2009
Instance terminated by LCK0, pid = 7823
2. /opt/oracle/crs/log/data4/crsd/crsd.log
2009-04-18 19:32:42.373: [ default][46946624][ENTER]0
Oracle Database 10g CRS Release 10.2.0.1.0 Production Copyright 1996, 2004, Oracle. All rights reserved
2009-04-18 19:32:42.391: [ default][46946624]0CRS Daemon Starting
2009-04-18 19:32:42.392: [ CRSMAIN][46946624]0Checking the OCR device
2009-04-18 19:32:42.403: [ CRSMAIN][46946624]0Connecting to the CSS Daemon
2009-04-18 19:32:42.601: [ COMMCRS][63451776]clsc_connect: (0x60000000000e93f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_data4_crs))
2009-04-18 19:32:42.602: [ CSSCLNT][46946624]clsssInitNative: connect failed, rc 9
2009-04-18 19:32:42.602: [ CRSRTI][46946624]0CSS is not ready. Received status 3 from CSS. Waiting for good status
3.[ CSSD]2009-04-18 19:32:43.321 >USER: Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle. All rights reserved.
[ CSSD]2009-04-18 19:32:43.321 >USER: CSS daemon log for node data4, number 4, in cluster crs
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=data4DBG_CSSD))
[ CSSD]2009-04-18 19:32:43.341 [48727072] >TRACE: clssscmain: local-only set to false
[ CSSD]2009-04-18 19:32:43.345 [48727072] >TRACE: clssnmReadNodeInfo: added node 1 (data1) to cluster
[ CSSD]2009-04-18 19:32:43.356 [48727072] >TRACE: clssnmReadNodeInfo: added node 2 (data2) to cluster
[ CSSD]2009-04-18 19:32:43.359 [48727072] >TRACE: clssnmReadNodeInfo: added node 3 (data3) to cluster
[ CSSD]2009-04-18 19:32:43.362 [48727072] >TRACE: clssnmReadNodeInfo: added node 4 (data4) to cluster
[ CSSD]2009-04-18 19:32:43.379 [94761600] >TRACE: clssnm_skgxnmon: skgxn init failed, rc 1
[ CSSD]2009-04-18 19:32:43.379 [48727072] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
[ CSSD]2009-04-18 19:32:43.380 [48727072] >TRACE: clssnmInitNMInfo: misscount set to 60
[ CSSD]2009-04-18 19:32:43.382 [48727072] >TRACE: clssnmDiskStateChange: state from 1 to 2 disk (0//dev/raw/raw2)
[ CSSD]2009-04-18 19:32:45.384 [94761600] >TRACE: clssnmDiskStateChange: state from 2 to 4 disk (0//dev/raw/raw2)
[ CSSD]2009-04-18 19:32:45.411 [94761600] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(24642) LATS(0) Disk lastSeqNo(24642)
[ CSSD]2009-04-18 19:32:45.411 [94761600] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(24) wrtcnt(312134) LATS(0) Disk lastSeqNo(312134)
[ CSSD]2009-04-18 19:32:45.411 [94761600] >TRACE: clssnmReadDskHeartbeat: node(3) is down. rcfg(27) wrtcnt(383485) LATS(0) Disk lastSeqNo(383485)
[ CSSD]2009-04-18 19:32:45.436 [48727072] >TRACE: clssnmFatalInit: fatal mode enabled
[ CSSD]2009-04-18 19:32:45.448 [116650624] >TRACE: clssnmconnect: connecting to node 4, flags 0x0001, connector 1
[ CSSD]2009-04-18 19:32:45.448 [116650624] >TRACE: clssnmconnect: connecting to node 4, flags 0x0001, connector 1
[ CSSD]2009-04-18 19:32:45.449 [116650624] >TRACE: clssnmconnect: connecting to node 0, flags 0x0000, connector 1
[ CSSD]2009-04-18 19:32:45.449 [116650624] >TRACE: clssnmconnect: connecting to node 1, flags 0x0001, connector 0
[ CSSD]2009-04-18 19:32:45.449 [116650624] >TRACE: clssnmconnect: connecting to node 2, flags 0x0001, connector 0
[ CSSD]2009-04-18 19:32:45.449 [116650624] >TRACE: clssnmconnect: connecting to node 3, flags 0x0001, connector 0
[ CSSD]2009-04-18 19:32:45.450 [116650624] >TRACE: clsc_send_msg: (0x60000000001c3fd0) NS err (12571, 12560), transport (530, 111, 0)
[ CSSD]2009-04-18 19:32:45.455 [128266880] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_4))
[ CSSD]2009-04-18 19:32:45.455 [128266880] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_data4_crs))
[ CSSD]2009-04-18 19:32:47.435 [94761600] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(1) LATS(4194414596) Disk lastSeqNo(1)
[ CSSD]2009-04-18 19:32:49.440 [94761600] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(2) LATS(4194416602) Disk lastSeqNo(2)
[ CSSD]2009-04-18 19:32:49.468 [116650624] >TRACE: clssnmConnComplete: probe from node 1
[ CSSD]2009-04-18 19:32:49.468 [116650624] >TRACE: clssnmconnect: connecting to node 1, flags 0x0001, connector 1
[ CSSD]2009-04-18 19:32:49.469 [116650624] >TRACE: clssnmConnComplete: connected to node 1 (con 0x6000000000223ec0), state 1 birth 0, unique 1240054366/1240054366 prevConuni(0)
[ CSSD]2009-04-18 19:32:50.443 [94761600] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(3) LATS(4194417605) Disk lastSeqNo(3)
[ CSSD]2009-04-18 19:32:51.446 [94761600] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(4) LATS(4194418608) Disk lastSeqNo(4)
[ CSSD]2009-04-18 19:32:52.449 [94761600] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(5) LATS(4194419611) Disk lastSeqNo(5)
不明白为什么 ocssd 没起来,从哪方面排除故障,请高手指教 。
