10g RAC 管理进程无限制增长的问题

king3171 发表于 2008-12-28 22:38

系统环境 HP 11.23 ，HAMCGUARD 11.16 FOR ORACLE 10.2.0.4RAC
目前只有不到10个用户，其中1个节点用户有个进程不断的增长，直到最后这个节点挂死，TELNET都无法连接，另外一个正常

查看该节点
有一个进程（进程号29921）打开了15577个hc_btsb1.dat文件，而且不断增长，到小机上用ps查看，这个进程号对应的进程名为racgimon，这个进程是Oracle的系统进程。B机上也有对应的进程，但只打开了2个hc_btsb2.dat。
Oracle Cluster使用RAC Global Instance Monitor（racgimon）检查每个节点上的实例的可用性。
摘录如下：
。。。。。。
27171 racgimon29921 oracle *152u REG          64,0xb    1544 35182 /oracle/app/product/10.2.0/dbs/hc_btsb1.dat
27172 racgimon29921 oracle *153u REG          64,0xb    1544 35182 /oracle/app/product/10.2.0/dbs/hc_btsb1.dat
27173 racgimon29921 oracle *154u REG          64,0xb    1544 35182 /oracle/app/product/10.2.0/dbs/hc_btsb1.dat
27174 racgimon29921 oracle *155u REG          64,0xb    1544 35182 /oracle/app/product/10.2.0/dbs/hc_btsb1.dat
27175 racgimon29921 oracle *156u REG          64,0xb    1544 35182 /oracle/app/product/10.2.0/dbs/hc_btsb1.dat
。。。。。。

宕机那个时刻alert和TRC文件都不断出现这样的信息
Process startup failed, error stack:
ORA-27300: OS system dependent operation:pipe failed with status: 23
ORA-27301: OS failure message: File table overflow
ORA-27302: failure occurred at: skgpspawn2
*** 2008-12-13 02:00:21.838
Cannot open alert file "/oracle/app/admin/btsb/bdump/alert_btsb1.log"; errno = 23
*** 2008-12-13 02:00:21.838
Process startup failed, error stack:
ORA-27300: OS system dependent operation:pipe failed with status: 23
ORA-27301: OS failure message: File table overflow
ORA-27302: failure occurred at: skgpspawn2
Cannot open alert file "/oracle/app/admin/btsb/bdump/alert_btsb1.log"; errno = 23
*** 2008-12-13 10:00:07.873

在METALINK里查ORA-27300相关信息，没有 status: 23这个代码的解释
日志里Cannot open alert file "/oracle/app/admin/btsb/bdump/alert_btsb1.log"; errno = 23我判断是已经到了打开文件数的上限，所以无法打开，racgimon 进程不断的打开hc_btsb1.dat这个文件直到把系统撑爆，怎么解决呢，当然不能靠重起，那只是临时无奈之举

[ 本帖最后由 king3171 于 2008-12-28 22:42 编辑 ]

king3171 发表于 2008-12-28 23:18

查到了：Base Bug 6931689
Problem statement:

RACGIMON HAS FILE HANDLE LEAK ON HEALTHCHECK FILE
--------------------------------------------------------------------------------

*** 07/04/08 04:56 am *** TAR: ---- . PROBLEM: -------- racgimon has file handle leak on healthcheck file. . At the customer's site, ServiceGuard detected Split Brain thena node was bounced. At that time, "ORA-27301: OS failure message: File table overflow" was recorded on alert.log. Also, "glance" showed that racgimon was opening more than 26,000 filehandles. The racgimon process was started around 20 days ago(14th Jun). Due to the handle leak by racgimon, the operating system was exhausting the kernel limit for maximum opened files ("nfile" on HP-UX). . DIAGNOSTIC ANALYSIS: -------------------- The handle leak by racgimon reproduces inhouse too(Oracle Japan's test environment). . At first, I checked handle leak by racgimon by glance command inhouse. The opened file was"$ORACLE_HOME/log/< NodeName>/racg/imon_< InstanceName>.log" Roughly, the number of opened handle increased about 6 in 10 minutes. . Open Files PID: 27683, racgimon PPID: 1 euid: 128 User: rac10204 OpenOpen FDFile Name Type ModeCoun -------------------------------------------------------------------- 32 < fifo,hfs,inode:4417096> fifo read 2 33 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 2 34 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 2 35 < fifo,hfs,inode:4417097> fifo write 2 36 < fifo,hfs,inode:4417098> fifo read 1 37 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 2 38 /fs02/.../product/10204db/rdbms/mesg/ocius.msb reg read 1 39 inet,udp socket rd/wr 1 40 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 41 < fifo,hfs,inode:4417507> fifo write 1 42 < fifo,hfs,inode:4417508> fifo read 1 43 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 44 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 45 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 46 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 47 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 48 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 ... 169 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 170 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 171 /fs02/.../product/10204db/dbs/hc_r10241.dat reg rd/wr 1 ... During the handle leak, ragimon log recoded the following error at every 60 secondes(Health check interval). .- imon_r1024.log . 2008-07-04 16:16:24.707: : GIMH: GIM-00104: Health check failed to connect to instance. GIM-00090: OS-dependent operation:mmap failed with status: 12 GIM-00091: OS failure message: Not enough space GIM-00092: OS failure occurred at: sskgmsmr_13 2008-07-04 16:17:24.744: : GIMH: GIM-00104: Health check failed to connect to instance. GIM-00090: OS-dependent operation:mmap failed with status: 12 GIM-00091: OS failure message: Not enough space GIM-00092: OS failure occurred at: sskgmsmr_13 2008-07-04 16:18:24.791: : GIMH: GIM-00104: Health check failed to connect to instance. GIM-00090: OS-dependent operation:mmap failed with status: 12 GIM-00091: OS failure message: Not enough space GIM-00092: OS failure occurred at: sskgmsmr_13 . The error recorded on imon_r1024.log above seems same as Bug:6931689. On the other hand, Bug:6989661 explains an looping error in racgimon can result in opened files not closed. So I guess the racgimon was looping error due to Bug:6931689, then the loop error caused handle leak. At last, it exceeded "nfile" on HP-UX and ServiceGuard, Oracle, or any other applications could not run normally. . WORKAROUND: ----------- kill racgimon sometimes. . RELATED BUGS: ------------- Bug:6989661 Bug:6931689 . REPRODUCIBILITY: ---------------- It reproduced both customer's site and inhouse on 10.2.0.4 RAC. It did not reproduce on 10.2.0.3 RAC inhouse. . Rep? Platform RDBMS CRS ------- ------------------ -------- -------- Y(100%) 179 HP-UX Itanium10.2.0.4 10.2.0.4 (Customer's site) Y(100%) 179 HP-UX Itanium10.2.0.4 10.2.0.4 (Inhouse) N(0%) 179 HP-UX Itanium10.2.0.3 10.2.0.3 (Customer's site) . The customer states that the handle leak by racgimon was seen on RAC 10.2.0.3 + MLR#10 Bug:6273339 too. . TEST CASE: ---------- n/a . STACK TRACE: ------------ n/a . SUPPORTING INFORMATION: ----------------------- I think it is easier to provide access info to the reproducing box(Oracle Japan's environment) than uploading all CRS logs. Or I will upload CRS log, DB log, syslog from the box. . 24 HOUR CONTACT INFORMATION FOR P1 BUGS: ---------------------------------------- n/a . DIAL-IN INFORMATION: -------------------- n/a . IMPACT DATE: ------------ n/a . *** 07/07/08 12:19 am *** *** 07/07/08 12:22 am *** ADDITIONAL INFO ================ . It seems the handle leak occurs only on 1 node, that has the youngest node number. For example, The customer reports handle leak occured on node2 when they run DB on node2,3,4. Also the customer reports the handle leak occured on node1 when they run DB on node1,2,3,4. . Currently they are watching racgimon on all nodes, and kill racgimon if there is excessive handle leak. If this leak happenes only on 1 node, please let me know. *** 07/07/08 12:23 am *** (CHG: Sta->16) *** 07/07/08 12:37 am *** (CHG: Asg->NEW OWNER) *** 07/07/08 12:51 am *** (CHG: Sta->10) *** 07/07/08 12:51 am *** *** 07/08/08 01:58 am *** (CHG: Sta->16) *** 07/08/08 01:58 am *** *** 07/08/08 03:08 am *** (CHG: G/P->P Asg->NEW OWNER) *** 07/08/08 03:08 am *** *** 07/08/08 03:08 am *** (CHG: Asg->NEW OWNER) *** 07/08/08 06:11 am *** (CHG: Asg->NEW OWNER) *** 07/08/08 11:57 am *** (CHG: Sta->36) *** 07/08/08 11:57 am *** *** 07/08/08 05:50 pm ***

Yong Huang 发表于 2008-12-29 01:19

I had the same file descriptor leak problem for the health check file on Linux running 10.2.0.4 database with 10.2.0.3 EM agent, which led Oracle to create Note 563575.1. The solution was to upgrade agent to 10.2.0.4.

Your description of the problem is clear, and you posted back your own findings. Thanks.

Yong Huang

shahand 发表于 2008-12-29 10:00

实在不行就把这个东西mv走，但你自己要多监控一下数据库了:)

owlstudio 发表于 2008-12-29 10:19

metalink就是帮助DBA解决问题的，一般的问题都能在metalink上查到

king3171 发表于 2008-12-30 00:42

原帖由 shahand 于 2008-12-29 10:00 发表 http://www.itpub.net/images/common/back.gif
实在不行就把这个东西mv走，但你自己要多监控一下数据库了:)

把那个文件MV走估计是不行，我倒是可以找个下班时间协调用户测一下，但METALINK对这个BUG给出的办法是：kill racgimon sometimes. 还没有推出解决的补丁，什么给补丁也不知道，愁啊,另外我这个标题写的有问题，应该是RAC管理进程 racgimon无限打开文件，而不是进程无限增长

Bug No. 7235094
Filed 04-JUL-2008 Updated 08-JUL-2008
Product Oracle Server - Enterprise Edition Product Version 10.2.0.4
Platform HP-UX Itanium Platform Version No Data
Database Version 10.2.0.4 Affects Platforms Port-Specific
Severity Severe Loss of Service Status Duplicate Bug. To Filer
Base Bug 6931689
Fixed in Product Version No Data

[ 本帖最后由 king3171 于 2008-12-30 12:12 编辑 ]

king3171 发表于 2008-12-31 17:14

已经找到补丁了，使用oracle10.2.0.4 RAC 的朋友遇到同样的问题可以用这个补丁解决
Cause
====
The cause of this problem has been identified and verified in an unpublished Bug 6931689. It is caused by an mmap
error.

Bug 7235094 RACGIMON HAS FILE HANDLE LEAK ON HEALTHCHECK FILE
was closed as duplicate of unpublished Bug 6931689.

Solution
========
To fix this issue please apply following patch:
Patch 7298531 CRS MLR#2 ON TOP OF 10.2.0.4 FOR BUGS 6931689 7174111 6912026 7116314

Please apply the patch and let me know the status.

Article-ID: Note 739557.1
Title: File handles not released after upgrade to 10.2.0.3 CRS Bundle#2 or
10.2.0.4

页: [1]

ITPUB论坛－专业的IT技术社区's Archiver

10g RAC 管理进程无限制增长的问题