刚刚节点2自动重启了，已经是第 5 次了

andyxu · 发表于 2010-3-11 10:22

Linux Cluster应该也有重启的功能吧？问题是也应该会有记录的哦

jardon_zhao · 发表于 2010-3-11 10:24

是否是存储的问题，导致voting disk访问延时？

tolywang · 发表于 2010-3-11 11:01

原帖由 andyxu 于 2010-3-11 10:22 发表
Linux Cluster应该也有重启的功能吧？问题是也应该会有记录的哦

是的，如果是linux cluster (gfs) 导致的，会在linux log中有类似 fence 的记录，这里也没有。
Loading高可以导致机器hang住，听说过，导致异常启动还没有听说。

tolywang · 发表于 2010-3-11 11:04

原帖由 jardon_zhao 于 2010-3-11 10:24 发表
是否是存储的问题，导致voting disk访问延时？

如果是存储的问题，应该一直有问题吧，还是说达到一定程度才被激发出来 voting disk 访问延迟的问题？

inthirties2 · 发表于 2010-3-11 14:14

[cssd(10708)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 15.000 seconds
2010-03-09 22:11:25.348
[cssd(10708)]CRS-1612:node hou249bbodb3112 (2) at 50% heartbeat fatal, eviction in 14.010 seconds
2010-03-09 22:11:32.348
[cssd(10708)]CRS-1611:node hou249bbodb3112 (2) at 75% heartbeat fatal, eviction in 7.010 seconds
2010-03-09 22:11:37.352
[cssd(10708)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal, eviction in 2.000 seconds
2010-03-09 22:11:38.348
[cssd(10708)]CRS-1610:node hou249bbodb3112 (2) at 90% heartbeat fatal, eviction in 1.010 seconds
2010-03-09 22:11:39.353
[cssd(10708)]CRS-1607:CSSD evicting node hou249bbodb3112. Details in /u01/app/oracle/product/crs/log/hou249bbodb3111/cssd/ocssd.log.

心跳的问题。

我以前也遇到RAC节点reboot的，记得好像是没有记下关闭的log。

tolywang · 发表于 2010-3-12 09:14

还是用的两台Dlink的交换机做冗余，通过bond 网卡实现心跳冗余，但是选择的 mode =2 . 不知道和这个有没有关系，一般实现active , standby就可以了，这里设置的貌似 loadbalance (交接过来的系统，没有文档) 。

mode=0 (balance-rr)
Round-robin policy: Transmit packets in sequential order from the first available slave through the last. This mode provides load balancing and fault tolerance.

mode=1 (active-backup)
Active-backup policy: Only one slave in the bond is active. A different slave becomes active if, and only if, the active slave fails. The bond's MAC address is externally visible on only one port (network adapter) to avoid confusing the switch. This mode provides fault tolerance. The primary option affects the behavior of this mode.

mode=2 (balance-xor)
XOR policy: Transmit based on [(source MAC address XOR'd with destination MAC address) modulo slave count]. This selects the same slave for each destination MAC address. This mode provides load balancing and fault tolerance.

mode=3 (broadcast)
Broadcast policy: transmits everything on all slave interfaces. This mode provides fault tolerance.

tolywang · 发表于 2010-3-12 09:20

结构图，交换机是DLink, 对外的public 网卡也是冗余的 (可以从图中看出，10.18网段)， 172.16 段是private NIC

[ 本帖最后由 tolywang 于 2010-3-12 09:23 编辑 ]

terry_wu040802 · 发表于 2010-3-12 16:16

LIP reset occured (f7f7)
感觉是这个的原因，不过不知道LIP是什么意思？起什么作用的？

tolywang · 发表于 2010-3-13 01:24

原帖由 terry_wu040802 于 2010-3-12 16:16 发表
LIP reset occured (f7f7)
感觉是这个的原因，不过不知道LIP是什么意思？起什么作用的？

这个是存活的那台节点1 上的信息(应该不是报错) ，好像只要其他节点 unmount 共享磁盘的时候会有这种 reset 的提示。

Mar  9 22:11:11 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.0: LIP reset occured (f7f7).
Mar  9 22:11:11 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.0: LIP occured (f7f7).
Mar  9 22:11:11 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.1: LIP reset occured (f7f7).
Mar  9 22:11:11 hou249bbodb3111 kernel: qla2xxx 0000:0d:00.1: LIP occured (f7f7).

tolywang · 发表于 2010-3-13 01:27

两个文档可以参考。

itpub110 提供的  ： doc 265769.1

Yong Huang 版主提供的：

According to the first `top' output, I suspect it's GFS. Not many people use that for RAC. (We use ASM, like most people in the world.) You may open a ticket with Red Hat on that. Also search for bugs on GFS.

At the end of this document
https://support.oracle.com/CSP/m ... 9530.1&type=NOT
there's simple check list for GFS. Can you go through it? Work with Red Hat and your UNIX and storage engineers. Is there a way to enable debugging or more verbose output for GFS?

> 如果是存储的问题，应该一直有问题吧，

That's not true.

> 还是说达到一定程度才被激发出来 voting disk 访问延迟的问题  ？

I think so. I'm guessing the high sys CPU usage was attributed to GFS. But I don't know what triggered it and how.

https://support.oracle.com/CSP/main/article?cmd=show&id=329530.1&type=NOT

Using Redhat Global File System (GFS) as shared storage for RAC [ID 329530.1]

		Modified 01-AUG-2008 Type BULLETIN Status PUBLISHED

In this Document
[size=-1]  Purpose
  Scope and Application
  Using Redhat Global File System (GFS) as shared storage for RAC
   Oracle and GFS Support
   I. Support Processes for GFS 6.0 and 6.1
   II. Supported Configuration Details for Oracle 9iRAC, 10gR1 RAC and 10gR2 RAC
   III. Components of a RAC cluster that can use GFS
   IV. Isolation of Red Hat Global File System (GFS) Issues:
   V. More information about Red Hat GFS and lock server configurations can be found at:

Applies to: Oracle Server - Enterprise Edition - Version: 9.2.0.1 to 10.2.0.1
Linux x86-64
Linux x86
Real Application Clusters with shared storage on Global File System
PurposeThe purpose of this bulletin is to explain the support process of when using Redhat Global File System (GFS) as the shared file system for Real Application Clusters
Scope and ApplicationThis note is intended for Customers using Real Application Clusters with shared storage on GFS. Using GFS on single instance is supported.
Using Redhat Global File System (GFS) as shared storage for RACOracle and GFS SupportI. Support Processes for GFS 6.0 and 6.1

Oracle versions 9.2 RAC,10gR1 RAC and 10gR2 RAC have been successfully tested by Oracle with Red Hat GFS on RHEL4 and RHEL5
Oracle Enterprise Linux: as part of the Oracle Unbreakable Linux Program, Oracle offers for free download or on CD:
Oracle Enterprise Linux 4 fully compatible with RedHat Enterprise Linux 4 AS/ES
Red Hat has a separate product called Red Hat cluster suite which included GFS for RHEL4 . Oracle does not distribute or support the Red Hat cluster suite for RHEL4 packages and as such, GFS is not part of OEL4. Unbreakable Linux support does not support additional packages and products from RHAT other than OEL
Oracle Enterprise Linux 5 fully compatible with Red Hat Linux 5 Server and Advanced Platform.
Red Hat Linux 5 Server and Advanced Platform includes GFS, so does OEL5 and Unbreakable Linux Support Program support Unbreakable Linux customers running GFS
Oracle will provide support for all Oracle products certified on Red Hat Enterprise Linux to all customers
Oracle will provide GFS support on RHEL5 and Oracle Enterprise Linux 5 for Unbreakable Linux Customers only. For all other situations (Unbreakable Linux Customers running OEL4 or RHEL4, non Unbreakable Linux customers running RHEL4 and RHEL5) the following will apply
Oracle's product support teams will not take support calls on Red Hat GFS. All issues known to be related to Red Hat GFS must be opened with Red Hat directly. When an Oracle SR is opened for an Oracle product or a Red Hat Enterprise Linux issue in a configuration that includes GFS, Oracle Support will do their best effort to determine if the issue is GFS software related. In that case, Oracle will hand-off the GFS related issue to Red Hat Support. Oracle will continue to provide support for all Oracle products and Red Hat Enterprise Linux as described in http://www.oracle.com/support/collateral/elsp-coverage.pdf
Joint support escalations: When Oracle RAC, Red Hat GFS, and Red Hat Enterprise Linux are part of a joint customer solution and an SR requires escalation, the processes established in the current Red Hat/Oracle joint support agreement should be followed. (i.e. Oracle will support Oracle products. If the customer has a Linux support contract with Oracle or is during the free trial period, Oracle will also support Enterprise Linux or RedHat Enterprise Linux. RedHat will support RedHat GFS.)

II. Supported Configuration Details for Oracle 9iRAC, 10gR1 RAC and 10gR2 RACPlease note that GFS on 10gR2 RAC is now certified to a specific patch set level. GFS on 10gR1 or 9iR2 is certified to the maintenance-release level (i.e. any version of 9.2.0.x). However, it is highly recommended that the customer upgrade to the most recent patch release of Oracle RAC.

Some 3rd party applications are also certified to a specific patch set level of RAC. Please consult the certification matrix for the 3rd party application to insure certification compatibility with any of the existing configurations below

*[A] Oracle 10.2.0.3/10.2.0.4 RAC
RHEL 5 Update 1 and above
GFS 6.1
CLVM for GFS volume management
Device Mapper for storage multipathing
x86 and x86_64 architectures
DLM lock manager with Qdisk (see below)

*[B] Oracle 10.2.0.3/10.2.0.4 RAC
RHEL 4 Update 5 and above
GFS 6.1
CLVM for GFS volume management
Device Mapper for storage multipathing
x86 and x86_64 architectures
DLM lock manager with Qdisk (see below)

*[C] Oracle 10.2.0.3 RAC
RHEL 4 Update 3 and above
GFS 6.1
CLVM for GFS volume management
Device Mapper for storage multipathing
x86 and x86_64 architectures
GULM lock manager (see below)

*[D] Oracle 10.2.0.3 RAC
RHEL 3 Update 6 and above
GFS 6.0
Pool for GFS volume management
Qlogic for storage multipathing
x86 and x86_64 architectures
GULM lock manager (see below)

*[E] Oracle 10gR1 RAC
RHEL 4 Update 3 and above
GFS 6.1
CLVM for GFS volume management
Device Mapper for storage multipathing
x86 and x86_64 architectures
GULM lock manager (see below)

Specific versions of Red Hat GFS are certified for use with RAC using either the GULM (Grand Unified Lock Manager) or the DLM (Distributed Lock Manager).

For mission-critical Oracle RAC environments, Oracle supports the GULM external lock manager configurations with dedicated lock server nodes or the DLM ××ded lock manager in conjunction with the Qdisk quorum disk facility. Either configuration allows you to reboot, remove, or add Oracle RAC server nodes without affecting lock manager availability and hence the operation of other nodes in the Oracle RAC cluster.

With the GULM, GFS uses a centralized lock server daemon that can be configured as a single master and multiple slave server daemons on physically separate nodes (two or four slave nodes). If the master lock server fails, one of the slave lock servers that has been keeping an updated copy of the lock state becomes the master lock server. In this manner single node lock server failures can be tolerated. Additionally, multiple RAC instances can share GULM external lock server nodes within a larger single GFS cluster. The ability to consolidate RAC instances while using more robust lock management server nodes may prove architecturally desirable for some GFS RAC implementations.

With the DLM, GFS uses a distributed lock manager co-located on the RAC nodes in the cluster. To ensure that RAC can operate with only 1 surviving node, the Qdisk facility must be configured when the DLM is selected. This facility provides a shared quorum disk on a raw disk partition to hold the quorum for GFS together with the remaining RAC node. The DLM/Qdisk facility comes standard with RHEL4 update 5 and RHEL5.

Oracle RAC on GFS was certified on a fully redundant configuration with both Ethernet channel bonding on all RAC GCS (Global Cache Services) pathways and DM-multipath on all storage pathways. It is highly recommended that customers run RAC in a fully redundant configuration, but they can choose to run at a lower level of redundancy, if this meets their business requirements

III. Components of a RAC cluster that can use GFSBoth shared home and database files can reside on GFS using the context dependant path names feature.

The Oracle Clusterware file OCR (Oracle Cluster Repository) and quorum VOTE file must be placed on raw devices. This is due to a limitation in the Clusterware installer to recognize GFS.

These files may also be configured using DM-multipath. In certain versions of Clusterware, the initialization utility fails to recognize the rawdevices as valid devices when accessed via DM-mulitpath. DM-multipath can be temporarily disabled on the node where the Clusterware installer script root.sh executes for the very first time. Once the 2 files are initialized, all subsequent DM-multipath access to these files proceeds without error.

Both direct IO (DIO) and asynchronous IO (AIO) for GFS are fully supported in RHEL4 (Update 5 is recommended) and RHEL5. These options significantly increase performance when used with Oracle 10g RAC. It is highly recommended that RAC implementations enable DIO at a minimum. Oracle with DIO avoids the duplication of the Oracle SGA buffer cache within the RHEL page cache. AIO is especially beneficial as IO rates increase. The Oracle init.ora parameter filesystemio_options is used to specify AIO, DIO or both.

FILESYSTEMIO_OPTIONS = { none | setall | directIO | asynch }

The asynch option (AIO without DIO) is not a supported combination for RAC on GFS

IV. Isolation of Red Hat Global File System (GFS) Issues:If an issue is suspected by Oracle Support to be GFS software related, the issue would be transferred to Red Hat Support after advising the customer to collect the following information required by Red Hat Support. The collection of this information is the customers responsibility.

Please verify all of the items below to determine that a case is due to GFS software

The output of hostname and uname -n should be identical.
All systems should be able to ping each other by hostname.
Verify that the kernel is not tainted by executing lsmod.
The command rpm -qa | grep GFS should state that GFSUserToolsRPM and GFSKernelModsRPM are installed.
The command rpm -q perl-Net-Telnet should state that the perl-Net-Telnet package is installed.
Verify that the system times on all nodes/servers are within 5 minutes of each other.
If network storage is being used, all systems should be able to see attached LUNS.
The output of iptables -L should not show any traffic being prevented between any systems in the GFS environment.
Customers should be advised that the Red Hat Support requires a sysreport from all systems experiencing problems. Sysreport can be installed by running up2date sysreport and then executed by entering sysreport from a shell prompt

V. More information about Red Hat GFS and lock server configurations can be found at:

http://www.redhat.com/docs/manuals/csgfs/
http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-locl-gulm.html

[ 本帖最后由 tolywang 于 2010-3-13 01:29 编辑 ]

[HA] 刚刚节点2自动重启了，已经是第 5 次了