刚刚节点2自动重启了，已经是第 5 次了

inthirties2 · 发表于 2010-3-16 08:58

tolywang 的问题贴，总是能看到一些不了解的东西。

lgydicanio · 发表于 2010-3-17 10:31

我之前遇到的問題是存在大量swap 交換引起cpu 資源被耗光不斷重啟，設定hugepage 解決這個問題，不過看你的好像沒使用swap 就down 了。

tolywang · 发表于 2010-3-17 20:46

原帖由 lgydicanio 于 2010-3-17 10:31 发表
我之前遇到的問題是存在大量swap 交換引起cpu 資源被耗光不斷重啟，設定hugepage 解決這個問題，不過看你的好像沒使用swap 就down 了。

是的，有時候loading不高也會down机

tolywang · 发表于 2010-3-28 09:06

节点2 在今天再次不正常重新启动，节点2上linux log , crs log , oracle log 没有任何有用的信息，都是开启时候的信息。
节点1上的 linux log 及 crs log 如下，可以看到节点1 上显示节点2 开始关闭机器，LIP reset occured 的时候(Mar 27 16:02:53)，
节点1上的heart beat 还没有处于危险的地步，直到 2010-03-27 16:03:07 , 节点1才开始感觉到heart beat fatal .

这说明是节点2 先异常重新启动，然后才导致节点1 上报出来的心跳 fatal 问题，最后节点1 发现节点2 不能 inter connect , 将
节点2 踢除出 cluster .

RAC01 Linux Log :

Mar 24 03:32:38 hou249bbodb3111 ntpd[6028]: synchronized to 10.17.50.73, stratum 2
Mar 25 05:21:01 hou249bbodb3111 auditd[4540]: Audit daemon rotating log files
Mar 26 08:41:01 hou249bbodb3111 auditd[4540]: Audit daemon rotating log files
Mar 27 12:03:01 hou249bbodb3111 auditd[4540]: Audit daemon rotating log files
Mar 27 16:02:53 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.1: LIP reset occured (f7f7).
Mar 27 16:02:53 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.0: LIP reset occured (f7f7).
Mar 27 16:02:53 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.1: LIP occured (f7f7).
Mar 27 16:02:53 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.0: LIP occured (f7f7).
Mar 27 16:03:03 hou249bbodb3111 openais[5499]: [TOTEM] The token was lost in the OPERATIONAL state.
Mar 27 16:03:03 hou249bbodb3111 openais[5499]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Mar 27 16:03:03 hou249bbodb3111 openais[5499]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes).
Mar 27 16:03:03 hou249bbodb3111 openais[5499]: [TOTEM] entering GATHER state from 2.
Mar 27 16:03:08 hou249bbodb3111 openais[5499]: [TOTEM] entering GATHER state from 0.
Mar 27 16:03:08 hou249bbodb3111 openais[5499]: [TOTEM] Creating commit token because I am the rep.
Mar 27 16:03:08 hou249bbodb3111 openais[5499]: [TOTEM] Saving state aru 21485d high seq received 21485d
Mar 27 16:03:08 hou249bbodb3111 openais[5499]: [TOTEM] Storing new sequence id for ring d8
Mar 27 16:03:08 hou249bbodb3111 openais[5499]: [TOTEM] entering COMMIT state.
Mar 27 16:03:08 hou249bbodb3111 openais[5499]: [TOTEM] entering RECOVERY state.

RAC01 CRS Log ( Heart Beat log )

2010-03-13 18:24:35.864
[crsd(10550)]CRS-1205:Auto-start failed for the CRS resource . Details in hou249bbodb3111.
2010-03-13 18:24:35.884
[crsd(10550)]CRS-1205:Auto-start failed for the CRS resource . Details in hou249bbodb3111.
[cssd(11219)]CRS-1601:CSSD Reconfiguration complete. Active nodes are hou249bbodb3111 hou249bbodb3112 .
2010-03-27 16:03:07.805
[cssd(11219)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 14.128 seconds
2010-03-27 16:03:08.801
[cssd(11219)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 13.138 seconds
2010-03-27 16:03:15.805
[cssd(11219)]CRS-1611:node hou249bbodb3112 (2) at 75% heartbeat fatal, eviction in 6.128 seconds
2010-03-27 16:03:19.801
[cssd(11219)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal, eviction in 2.138 seconds
2010-03-27 16:03:20.805
[cssd(11219)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal, eviction in 1.128 seconds
2010-03-27 16:03:21.801
[cssd(11219)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal, eviction in 0.138 seconds
2010-03-27 16:03:22.458
[cssd(11219)]CRS-1607:CSSD evicting node hou249bbodb3112. Details in /u01/app/oracle/product/crs/log/hou249bbodb3111/cssd/ocssd.log.
[cssd(11219)]CRS-1601:CSSD Reconfiguration complete. Active nodes are hou249bbodb3111 .
2010-03-27 16:03:42.736
[crsd(10550)]CRS-1204:Recovering CRS resources for node hou249bbodb3112.
[cssd(11219)]CRS-1601:CSSD Reconfiguration complete. Active nodes are hou249bbodb3111 hou249bbodb3112 .

tolywang · 发表于 2010-3-28 09:38

The CPU History Loading on the Node2 server (Very low).

Sat Mar 27 15:57:01 PDT 2010
0.35 0.45 0.40 4/619 8100
Sat Mar 27 15:58:01 PDT 2010
0.45 0.47 0.41 3/630 9804
Sat Mar 27 15:59:01 PDT 2010
1.64 0.70 0.48 3/609 11558
Sat Mar 27 16:00:01 PDT 2010
1.11 0.71 0.49 4/620 13268
Sat Mar 27 16:01:01 PDT 2010
0.79 0.67 0.49 4/668 15055
Sat Mar 27 16:02:01 PDT 2010
0.39 0.58 0.47 3/793 16894
Sat Mar 27 16:08:01 PDT 2010
2.00 0.81 0.29 5/415 9071
Sat Mar 27 16:09:01 PDT 2010
1.86 1.03 0.40 4/569 13485
Sat Mar 27 16:10:01 PDT 2010
1.02 0.95 0.42 7/592 15210
Sat Mar 27 16:09:22 PDT 2010
1.02 0.95 0.42 3/583 15230
Sat Mar 27 16:10:01 PDT 2010
0.83 0.91 0.42 4/591 15653

tolywang · 发表于 2010-3-29 10:51

今天又重启了两次，同样是节点2 。按照Yong Huang 提供的文档检查是否GFS 文件系统导致的，的确是有几个项目有不满足的。但是
不知道是不是和这个有关。  硬件问题也同步请HP的人查看。

Isolation of Red Hat Global File System (GFS) Issues:
If an issue is suspected by Oracle Support to be GFS software related, the issue would be transferred to
Red Hat Support after advising the customer to collect the following information required by Red Hat Support.
The collection of this information is the customers responsibility.
Please verify all of the items below to determine that a case is due to GFS software

The output of hostname and uname -n should be identical.
All systems should be able to ping each other by hostname.
Verify that the kernel is not tainted by executing lsmod.
The command rpm -qa | grep GFS should state that GFSUserToolsRPM and GFSKernelModsRPM are installed.
The command rpm -q perl-Net-Telnet should state that the perl-Net-Telnet package is installed.
Verify that the system times on all nodes/servers are within 5 minutes of each other.
If network storage is being used, all systems should be able to see attached LUNS.
The output of iptables -L should not show any traffic being prevented between any systems in the GFS environment.
Customers should be advised that the Red Hat Support requires a sysreport from all systems experiencing problems. Sysreport can be installed by running up2date sysreport and then executed by entering sysreport from a shell prompt

---------------------------------------------------
Check Items (Some items don't meet requirment) :
---------------------------------------------------

[root@hou249bbodb3112 ~]# su - oracle
hou249bbodb3112<*wmb2bprd2*/home/oracle>$
hou249bbodb3112<*wmb2bprd2*/home/oracle>$sqlplus / as sysdba
SQL> show parameter filesystemio_options
NAME                               TYPE       VALUE
------------------------------------ ----------- ------------------------------
filesystemio_options                string    directIO
SQL>

1.  The output of hostname and uname -n should be identical.
hou249bbodb3112<*wmb2bprd2*/home/oracle>$hostname
hou249bbodb3112
hou249bbodb3112<*wmb2bprd2*/home/oracle>$uname -a
Linux hou249bbodb3112 2.6.18-128.1.16.el5xen #1 SMP Fri Jun 26 11:10:46 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
hou249bbodb3112<*wmb2bprd2*/home/oracle>$

2. All systems should be able to ping each other by hostname.

[root@hou249bbodb3111 log]# ping hou249bbodb3112
PING hou249bbodb3112 (10.18.223.112) 56(84) bytes of data.
64 bytes from hou249bbodb3112 (10.18.223.112): icmp_seq=1 ttl=64 time=0.195 ms
64 bytes from hou249bbodb3112 (10.18.223.112): icmp_seq=2 ttl=64 time=0.162 ms
64 bytes from hou249bbodb3112 (10.18.223.112): icmp_seq=3 ttl=64 time=0.175 ms
64 bytes from hou249bbodb3112 (10.18.223.112): icmp_seq=4 ttl=64 time=0.171 ms

hou249bbodb3112<*wmb2bprd2*/home/oracle>$ping hou249bbodb3111
PING hou249bbodb3111 (10.18.223.111) 56(84) bytes of data.
64 bytes from hou249bbodb3111 (10.18.223.111): icmp_seq=1 ttl=64 time=0.187 ms
64 bytes from hou249bbodb3111 (10.18.223.111): icmp_seq=2 ttl=64 time=0.152 ms
64 bytes from hou249bbodb3111 (10.18.223.111): icmp_seq=3 ttl=64 time=0.162 ms
64 bytes from hou249bbodb3111 (10.18.223.111): icmp_seq=4 ttl=64 time=0.171 ms
64 bytes from hou249bbodb3111 (10.18.223.111): icmp_seq=5 ttl=64 time=0.159 ms

3. Verify that the kernel is not tainted by executing lsmod.

[root@hou249bbodb3112 ~]# lsmod
Module                Size  Used by
blktap             151653  2 [permanent]
blkbk                54777  0 [permanent]
ipt_MASQUERADE       36800  1
iptable_nat          40773  1
ip_nat                52973  2 ipt_MASQUERADE,iptable_nat
xt_state             35265  1
ip_conntrack          91237  4 ipt_MASQUERADE,iptable_nat,ip_nat,xt_state
nfnetlink             40457  2 ip_nat,ip_conntrack
ipt_REJECT          38849  2
xt_tcpudp             36289  4
iptable_filter       36161  1
ip_tables             55201  2 iptable_nat,iptable_filter
x_tables             50377  6 ipt_MASQUERADE,iptable_nat,xt_state,ipt_REJECT,xt_tcpudp,ip_tables
mptctl                63817  1
mptbase             113381  1 mptctl
ipmi_si             75680  3
ipmi_devintf          44432  6
ipmi_msghandler       72052  2 ipmi_si,ipmi_devintf
autofs4             57033  2
hidp                83521  2
gfs                324124  6
rfcomm             104809  0
l2cap                89281  10 hidp,rfcomm
bluetooth          118597  5 hidp,rfcomm,l2cap
lock_dlm             51425  0
gfs2                523820  1 lock_dlm
dlm                159425  30 gfs,lock_dlm
configfs             62301  2 dlm
bridge                91761  0
netloop             40129  0
netbk                130305  0 [permanent]
sunrpc             197513  1
bonding             120737  0
dm_multipath          55385  0
scsi_dh             41665  1 dm_multipath
video                53069  0
hwmon                36553  0
backlight             39873  1 video
sbs                   49921  0
i2c_ec                38593  1 sbs
i2c_core             56129  1 i2c_ec
button                40545  0
battery             43849  0
asus_acpi             50917  0
ac                   38729  0
ipv6                424737  116
xfrm_nalgo          43333  1 ipv6
crypto_api          42945  1 xfrm_nalgo
parport_pc          62313  0
lp                   47121  0
parport             73293  2 parport_pc,lp
joydev                43969  0
pcspkr                36289  0
sg                   69865  0
serial_core          56257  0
bnx2                207496  0
shpchp                70509  0
hpilo                43217  0
serio_raw             40517  0
ide_cd                73441  0
cdrom                68713  1 ide_cd
dm_raid45             98897  0
dm_message          36289  1 dm_raid45
dm_region_hash       46273  1 dm_raid45
dm_mem_cache          39489  1 dm_raid45
dm_snapshot          51593  0
dm_zero             35265  0
dm_mirror             54217  0
dm_log                44865  3 dm_raid45,dm_region_hash,dm_mirror
dm_mod             100497  18 dm_multipath,dm_raid45,dm_snapshot,dm_zero,dm_mirror,dm_log
ata_piix             56901  0
libata             208721  1 ata_piix
cciss                98633  3
ext3                167633  2
jbd                   94001  1 ext3
uhci_hcd             57561  0
ohci_hcd             56053  0
ehci_hcd             65741  0
qla2xxx             1015212  31
sd_mod                56385  8
scsi_mod             196697  7 mptctl,scsi_dh,sg,libata,cciss,qla2xxx,sd_mod
qla2xxx_conf       334856  1
intermodule          37508  2 qla2xxx,qla2xxx_conf
[root@hou249bbodb3112 ~]#

4. The command rpm -qa | grep GFS should state that GFSUserToolsRPM and GFSKernelModsRPM are installed.
Result :  Fail .
GFSUserToolsRPM and GFSKernelModsRPM are NOT installed .

[root@hou249bbodb3112 ~]# rpm -qa | grep GFS
[root@hou249bbodb3112 ~]#
[root@hou249bbodb3112 ~]#

5. The command rpm -q perl-Net-Telnet should state that the perl-Net-Telnet package is installed.
Result : OK
[root@hou249bbodb3112 ~]# rpm -q perl-Net-Telnet
perl-Net-Telnet-3.03-5
[root@hou249bbodb3112 ~]#

6. Verify that the system times on all nodes/servers are within 5 minutes of each other.
Result : OK

7. If network storage is being used, all systems should be able to see attached LUNS.
Result :  OK

8. The output of iptables -L should not show any traffic being prevented between any systems in the GFS environment.
Result : Fail
[root@hou249bbodb3112 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target    prot opt source             destination
ACCEPT    udp  --  anywhere          anywhere          udp dpt:domain
ACCEPT    tcp  --  anywhere          anywhere          tcp dpt:domain
ACCEPT    udp  --  anywhere          anywhere          udp dpt:bootps
ACCEPT    tcp  --  anywhere          anywhere          tcp dpt:bootps
Chain FORWARD (policy ACCEPT)
target    prot opt source             destination
ACCEPT    all  --  anywhere          192.168.122.0/24 state RELATED,ESTABLISHED
ACCEPT    all  --  192.168.122.0/24    anywhere
ACCEPT    all  --  anywhere          anywhere
REJECT    all  --  anywhere          anywhere          reject-with icmp-port-unreachable
REJECT    all  --  anywhere          anywhere          reject-with icmp-port-unreachable
Chain OUTPUT (policy ACCEPT)
target    prot opt source             destination
[root@hou249bbodb3112 ~]#

[root@hou249bbodb3111 log]# iptables -L
Chain INPUT (policy ACCEPT)
target    prot opt source             destination
ACCEPT    udp  --  anywhere          anywhere          udp dpt:domain
ACCEPT    tcp  --  anywhere          anywhere          tcp dpt:domain
ACCEPT    udp  --  anywhere          anywhere          udp dpt:bootps
ACCEPT    tcp  --  anywhere          anywhere          tcp dpt:bootps
Chain FORWARD (policy ACCEPT)
target    prot opt source             destination
ACCEPT    all  --  anywhere          192.168.122.0/24 state RELATED,ESTABLISHED
ACCEPT    all  --  192.168.122.0/24    anywhere
ACCEPT    all  --  anywhere          anywhere
REJECT    all  --  anywhere          anywhere          reject-with icmp-port-unreachable
REJECT    all  --  anywhere          anywhere          reject-with icmp-port-unreachable
Chain OUTPUT (policy ACCEPT)
target    prot opt source             destination
[root@hou249bbodb3111 log]#

yywxy · 发表于 2010-3-29 16:47

两台交换机不能做loadbalance吧
貌似之前有过IBM工程师不建议这么做呢
是不是这个原因

Yong Huang · 发表于 2010-3-30 01:40

My buddy looked at this case. He thinks the voting disk or OCR disk may have intermittent read problem. After all, the errors you got point to qla2xxx, which is HBA driver. Maybe you can set up a cronjob to run once per minute: dd if=[OCR and voting disk]. Or run it more frequently.

There's one more set of files you can check, the client files under CRS_LOG, e.g.:

$ cd /u01/crs/oracle/product/10.2.0/crs/log/`hostname -s`/client
$ ll
total 660
-rw-r--r-- 1 oracle oinstall 185 Mar 22 10:00 css1.log
-rw-r--r-- 1 oracle oinstall 186 Mar 29 10:00 css.log
-rw-r--r-- 1 oracle oinstall 662107 Mar 29 11:15 oifcfg.log
...

Sometimes some files here give a clue.

If the server had extremely high load, you would also see messages in /var/log/maillog related to it, causing sendmail to fail to process mails.

Your IP table blocks forwarding ICMP traffic (if I read it correctly). I think that's fine. Interconnect goes on UDP anyway, and you can ping each other. But it's a generally bad idea to place any restriction on any inter-node network traffic.

I'm not familiar with GFS or the 4th check you did (about GFSUserToolsRPM and GFSKernelModsRPM). It could be a problem. I suggest you open an SR with Oracle and/or ticket with Red Hat. At least ask the system admin about these two rpm's.

Bonding mode 2 is probably not the cause of the problem. But at least before you find the root cause, why not be more conservative and change it to mode 1? We always use mode 1.

Yong Huang

oradbHome · 发表于 2010-3-30 15:16

查一查为什么reboot 前的LOAD 这么高？可能是本地磁盘的I/O占用的。就像kamus 所说的pageout/in .

tolywang · 发表于 2010-3-30 23:33

原帖由 oradbHome 于 2010-3-30 15:16 发表
查一查为什么reboot 前的LOAD 这么高？可能是本地磁盘的I/O占用的。就像kamus 所说的pageout/in .

不过看图片中的 iowait 是 0.0% ，几乎没有I/O ，这个时候是不是因为共享磁盘在节点2 已经处于接近 unmount 状态了，
所以没有I/O, 但是sys cpu 负载非常高。

[HA] 刚刚节点2自动重启了，已经是第 5 次了

浏览过的版块