ITPUB??ì3
2010数据库技术大会
ITPUB论坛 » Oracle专题深入讨论 » rac 的ocssd怎么了?

您有 2 条公共消息
  • 来自: 公共消息 标题: 新开"PLM/PDM产品 ... 内容: 讨论范围包括:产品研发管理(PDM),产品生命周期管理(PLM),工艺/ ...
  • 来自: 公共消息 标题: 2010数据库技术大 ... 内容: “2010数据库技术大会”将于2010年4月2日~4月3日,在北京歌华开元大酒 ...

    标题: [讨论] rac 的ocssd怎么了?
    离线 lingbo_ty



    精华贴数 0
    个人空间 0
    技术积分 104 (18867)
    社区积分 0 (1947106)
    注册日期 2008-10-13
    论坛徽章:0
          
          

    发表于 2009-4-18 21:45 
    rac 的ocssd怎么了?

    我单位有rac ,linux as 5 , 使用asm, raw 存储设备 。 其中4个节点data1-data4,   昨天0:00出现 data instance 死掉。随后, data1,data4 相继 down .
         在data 1 ,data 4 的 alertlog中看到,检测到死instance, global resource directory frozen, reconfigure node 的信息。
       现在, 重启机器 crs 不能自动启动,手动在data1,data4 上运行 crsctl start crs,和 crsctl start resource 均没有反应。看了 ocssd,和 crs 的log , crs 在等ocssd,而ocssd 没起来,而  data 1 和data4 上的occsd.log 报告 好象提示两机器互锁。
        下面贴出相关信息。

    1. data 4   db instance shutdown 时的alertlog
    Reconfiguration complete
    Sat Apr 18 06:13:46 2009
    Error: unexpected error (6) from the Cluster Service (LCK0)
    Sat Apr 18 06:13:46 2009
    Errors in file /opt/oracle/admin/orcl/bdump/orcl4_lck0_7823.trc:
    ORA-29702: ´Ø×é·þÎñ²Ù×÷ÖóöÃÖ´íÎó
    LCK0: terminating instance due to error 29702
    Sat Apr 18 06:13:46 2009
    System state dump is made for local instance
    System State dumped to trace file /opt/oracle/admin/orcl/bdump/orcl4_diag_7559.trc
    Sat Apr 18 06:13:47 2009
    Trace dumping is performing id=[cdmp_20090418061346]
    Sat Apr 18 06:13:50 2009
    Instance terminated by LCK0, pid = 7823


    2.  /opt/oracle/crs/log/data4/crsd/crsd.log
    2009-04-18 19:32:42.373: [ default][46946624][ENTER]0
    Oracle Database 10g CRS Release 10.2.0.1.0 Production Copyright 1996, 2004, Oracle.  All rights reserved
    2009-04-18 19:32:42.391: [ default][46946624]0CRS Daemon Starting
    2009-04-18 19:32:42.392: [ CRSMAIN][46946624]0Checking the OCR device
    2009-04-18 19:32:42.403: [ CRSMAIN][46946624]0Connecting to the CSS Daemon
    2009-04-18 19:32:42.601: [ COMMCRS][63451776]clsc_connect: (0x60000000000e93f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_data4_crs))

    2009-04-18 19:32:42.602: [ CSSCLNT][46946624]clsssInitNative: connect failed, rc 9

    2009-04-18 19:32:42.602: [  CRSRTI][46946624]0CSS is not ready. Received status 3 from CSS. Waiting for good status

    3.[    CSSD]2009-04-18 19:32:43.321 >USER:    Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle.  All rights reserved.
    [    CSSD]2009-04-18 19:32:43.321 >USER:    CSS daemon log for node data4, number 4, in cluster crs
    [  clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=data4DBG_CSSD))
    [    CSSD]2009-04-18 19:32:43.341 [48727072] >TRACE:   clssscmain: local-only set to false
    [    CSSD]2009-04-18 19:32:43.345 [48727072] >TRACE:   clssnmReadNodeInfo: added node 1 (data1) to cluster
    [    CSSD]2009-04-18 19:32:43.356 [48727072] >TRACE:   clssnmReadNodeInfo: added node 2 (data2) to cluster
    [    CSSD]2009-04-18 19:32:43.359 [48727072] >TRACE:   clssnmReadNodeInfo: added node 3 (data3) to cluster
    [    CSSD]2009-04-18 19:32:43.362 [48727072] >TRACE:   clssnmReadNodeInfo: added node 4 (data4) to cluster
    [    CSSD]2009-04-18 19:32:43.379 [94761600] >TRACE:   clssnm_skgxnmon: skgxn init failed, rc 1
    [    CSSD]2009-04-18 19:32:43.379 [48727072] >TRACE:   clssnm_skgxnonline: Using vacuous skgxn monitor
    [    CSSD]2009-04-18 19:32:43.380 [48727072] >TRACE:   clssnmInitNMInfo: misscount set to 60
    [    CSSD]2009-04-18 19:32:43.382 [48727072] >TRACE:   clssnmDiskStateChange: state from 1 to 2 disk (0//dev/raw/raw2)
    [    CSSD]2009-04-18 19:32:45.384 [94761600] >TRACE:   clssnmDiskStateChange: state from 2 to 4 disk (0//dev/raw/raw2)
    [    CSSD]2009-04-18 19:32:45.411 [94761600] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(24642) LATS(0) Disk lastSeqNo(24642)
    [    CSSD]2009-04-18 19:32:45.411 [94761600] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(24) wrtcnt(312134) LATS(0) Disk lastSeqNo(312134)
    [    CSSD]2009-04-18 19:32:45.411 [94761600] >TRACE:   clssnmReadDskHeartbeat: node(3) is down. rcfg(27) wrtcnt(383485) LATS(0) Disk lastSeqNo(383485)
    [    CSSD]2009-04-18 19:32:45.436 [48727072] >TRACE:   clssnmFatalInit: fatal mode enabled
    [    CSSD]2009-04-18 19:32:45.448 [116650624] >TRACE:   clssnmconnect: connecting to node 4, flags 0x0001, connector 1
    [    CSSD]2009-04-18 19:32:45.448 [116650624] >TRACE:   clssnmconnect: connecting to node 4, flags 0x0001, connector 1
    [    CSSD]2009-04-18 19:32:45.449 [116650624] >TRACE:   clssnmconnect: connecting to node 0, flags 0x0000, connector 1
    [    CSSD]2009-04-18 19:32:45.449 [116650624] >TRACE:   clssnmconnect: connecting to node 1, flags 0x0001, connector 0
    [    CSSD]2009-04-18 19:32:45.449 [116650624] >TRACE:   clssnmconnect: connecting to node 2, flags 0x0001, connector 0
    [    CSSD]2009-04-18 19:32:45.449 [116650624] >TRACE:   clssnmconnect: connecting to node 3, flags 0x0001, connector 0
    [    CSSD]2009-04-18 19:32:45.450 [116650624] >TRACE:   clsc_send_msg: (0x60000000001c3fd0) NS err (12571, 12560), transport (530, 111, 0)
    [    CSSD]2009-04-18 19:32:45.455 [128266880] >TRACE:   clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_4))
    [    CSSD]2009-04-18 19:32:45.455 [128266880] >TRACE:   clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_data4_crs))
    [    CSSD]2009-04-18 19:32:47.435 [94761600] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(1) LATS(4194414596) Disk lastSeqNo(1)
    [    CSSD]2009-04-18 19:32:49.440 [94761600] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(2) LATS(4194416602) Disk lastSeqNo(2)
    [    CSSD]2009-04-18 19:32:49.468 [116650624] >TRACE:   clssnmConnComplete: probe from node 1
    [    CSSD]2009-04-18 19:32:49.468 [116650624] >TRACE:   clssnmconnect: connecting to node 1, flags 0x0001, connector 1
    [    CSSD]2009-04-18 19:32:49.469 [116650624] >TRACE:   clssnmConnComplete: connected to node 1 (con 0x6000000000223ec0), state 1 birth 0, unique 1240054366/1240054366  prevConuni(0)
    [    CSSD]2009-04-18 19:32:50.443 [94761600] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(3) LATS(4194417605) Disk lastSeqNo(3)
    [    CSSD]2009-04-18 19:32:51.446 [94761600] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(4) LATS(4194418608) Disk lastSeqNo(4)
    [    CSSD]2009-04-18 19:32:52.449 [94761600] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(1) wrtcnt(5) LATS(4194419611) Disk lastSeqNo(5)



          不明白为什么 ocssd 没起来,从哪方面排除故障,请高手指教 。

         


    只看该作者    顶部
    离线 lingbo_ty



    精华贴数 0
    个人空间 0
    技术积分 104 (18867)
    社区积分 0 (1947106)
    注册日期 2008-10-13
    论坛徽章:0
          
          

    发表于 2009-4-18 21:47 
    自己顶下, 贴出 data3 上  crsctl check crs 的结果


    Name           Type           Target    State     Host
    ------------------------------------------------------------
    ora....SM1.asm application    ONLINE    OFFLINE
    ora....A1.lsnr application    ONLINE    OFFLINE
    ora.data1.gsd  application    ONLINE    OFFLINE
    ora.data1.ons  application    ONLINE    OFFLINE
    ora.data1.vip  application    ONLINE    ONLINE    data3
    ora....SM2.asm application    ONLINE    OFFLINE
    ora....A2.lsnr application    ONLINE    OFFLINE
    ora.data2.gsd  application    ONLINE    OFFLINE
    ora.data2.ons  application    ONLINE    OFFLINE
    ora.data2.vip  application    ONLINE    OFFLINE
    ora....SM3.asm application    ONLINE    ONLINE    data3
    ora....A3.lsnr application    ONLINE    ONLINE    data3
    ora.data3.gsd  application    ONLINE    ONLINE    data3
    ora.data3.ons  application    ONLINE    ONLINE    data3
    ora.data3.vip  application    ONLINE    ONLINE    data3
    ora....SM4.asm application    ONLINE    OFFLINE
    ora....A4.lsnr application    ONLINE    OFFLINE
    ora.data4.gsd  application    ONLINE    OFFLINE
    ora.data4.ons  application    ONLINE    OFFLINE
    ora.data4.vip  application    ONLINE    ONLINE    data3
    ora.orcl.db    application    ONLINE    ONLINE    data3
    ora....l1.inst application    ONLINE    OFFLINE
    ora....l2.inst application    ONLINE    OFFLINE
    ora....l3.inst application    ONLINE    ONLINE    data3
    ora....l4.inst application    ONLINE    OFFLINE
    ora...._taf.cs application    ONLINE    ONLINE    data3
    ora....cl1.srv application    ONLINE    OFFLINE
    ora....cl2.srv application    ONLINE    OFFLINE
    ora....cl3.srv application    ONLINE    ONLINE    data3
    ora....cl4.srv application    ONLINE    OFFLINE


    只看该作者    顶部
    离线 Yong Huang
    版主



    精华贴数 3
    个人空间 0
    技术积分 6386 (263)
    社区积分 188 (2998)
    注册日期 2001-10-9
    论坛徽章:11
    现任管理团队成员ITPUB元老管理团队2006纪念徽章会员2006贡献徽章授权会员2010新春纪念徽章
    祖国60周年纪念徽章ITPUB8周年纪念徽章2009日食纪念2009新春纪念徽章2008新春纪念徽章 

    发表于 2009-4-19 10:15 
    Not sure what caused it. But did you have hardware issue around the time? Anything in /var/log/messages (excluding those session opened or closed lines)? Was there time change at 0:00?

    I suggest you opened an SR with Oracle.

    Yong Huang


    只看该作者    顶部
    离线 lingbo_ty



    精华贴数 0
    个人空间 0
    技术积分 104 (18867)
    社区积分 0 (1947106)
    注册日期 2008-10-13
    论坛徽章:0
          
          

    发表于 2009-4-19 14:30 
    谢谢 ! 下面是从  /var/log/message 里得到的信息


    Apr 18 00:00:14 data1 kernel: oracle(6290): floating-point assist fault at ip 4000000007fd8922, isr 0000020000001001
    Apr 18 00:00:14 data1 last message repeated 3 times
    Apr 18 01:01:01 data1 kernel: oracle(31468): floating-point assist fault at ip 4000000007fd8922, isr 0000020000001001
    Apr 18 01:01:01 data1 last message repeated 3 times
    Apr 18 02:16:30 data1 shutdown: shutting down for system halt
    Apr 18 02:16:30 data1 init: Switching to runlevel: 0
    Apr 18 02:16:30 data1 su(pam_unix)[5160]: session closed for user oracle
    Apr 18 02:16:30 data1 su(pam_unix)[6447]: session closed for user oracle
    Apr 18 02:16:32 data1 gpm[4603]: *** info [mice.c(1766)]:
    Apr 18 02:16:32 data1 gpm[4603]: imps2: Auto-detected intellimouse PS/2
    Apr 18 02:16:33 data1 cups-config-daemon: cups-config-daemon -TERM succeeded
    Apr 18 02:16:33 data1 haldaemon: haldaemon -TERM succeeded
    Apr 18 02:16:33 data1 messagebus: messagebus -TERM succeeded
    Apr 18 02:16:33 data1 atd: atd shutdown succeeded


    而且  在执行fdisk  -l  发现节点有些raw 盘没有了,信息不一致。


    只看该作者    顶部
    离线 lingbo_ty



    精华贴数 0
    个人空间 0
    技术积分 104 (18867)
    社区积分 0 (1947106)
    注册日期 2008-10-13
    论坛徽章:0
          
          

    发表于 2009-4-19 17:36 
    问题解决了,机房温度过高,导致硬件异常。


    调整温度后,重启磁盘阵列,重启 RAC server 。 恢复正常。


    只看该作者    顶部
    离线 Yong Huang
    版主



    精华贴数 3
    个人空间 0
    技术积分 6386 (263)
    社区积分 188 (2998)
    注册日期 2001-10-9
    论坛徽章:11
    现任管理团队成员ITPUB元老管理团队2006纪念徽章会员2006贡献徽章授权会员2010新春纪念徽章
    祖国60周年纪念徽章ITPUB8周年纪念徽章2009日食纪念2009新春纪念徽章2008新春纪念徽章 

    发表于 2009-4-20 00:26 
    Thank you very much for posting your solution. That benefits everybody.

    Are you running Itanium, judging by the message "floating-point assist fault at ip"? We used to have horrible server crash incidents on those servers but never figured out the root cause. I strongly suspected high temperature, because Itanium servers are very sensitive to that. See
    http://yong321.freeshell.org/oranotes/ItaniumProblem.txt

    Yong Huang


    只看该作者    顶部
    离线 lingbo_ty



    精华贴数 0
    个人空间 0
    技术积分 104 (18867)
    社区积分 0 (1947106)
    注册日期 2008-10-13
    论坛徽章:0
          
          

    发表于 2009-4-20 13:20 
    yes ,we use servers which use Itanium CPU . thank your answer.


    只看该作者    顶部
    离线 shahand
    版主


    精华贴数 7
    个人空间 100
    技术积分 21914 (55)
    社区积分 3645 (487)
    注册日期 2002-7-31
    论坛徽章:53
    现任管理团队成员管理团队2007贡献徽章    
          

    发表于 2009-4-20 16:48 
    clssnm_skgxnmon: skgxn init failed, rc 1.


    只看该作者    顶部
     
        

    相关内容


    CopyRight 1999-2006 itpub.net All Right Reserved.
    北京皓辰网域网络信息技术有限公司. 版权所有
    E-mail:Webmaster@itpub.net
    网站律师 隐私政策 知识产权声明
    京ICP证:060528号 联系我们