查看: 388|回复: 0

[原创] 如何处理 ORACLE hang

[复制链接]
论坛徽章:
19
娜美
日期:2017-06-26 15:18:15火眼金睛
日期:2018-09-30 22:00:00目光如炬
日期:2018-09-16 22:00:01目光如炬
日期:2018-09-02 22:00:00火眼金睛
日期:2018-08-31 22:00:00目光如炬
日期:2018-07-29 22:00:00火眼金睛
日期:2018-04-30 22:00:00目光如炬
日期:2018-04-29 22:00:00目光如炬
日期:2018-04-22 22:00:00火眼金睛
日期:2018-02-28 22:00:00
发表于 2018-8-10 17:10 | 显示全部楼层 |阅读模式


如何处理 ORACLE hang  

作者简介:
----------------------------------------------------------------------
@ 孙显鹏,海天起点oracle技术专家,十年从业经验
@ 精通oracle内部原理,擅长调优和解决疑难问题
@ 致力于帮助客户解决生产中的问题,提高生产效率。
@ 爱好:书法,周易,中医。微信:sunyunyi_sun
@ 易曰:精义入神,以致用也!
@ 调优乃燮理阴阳何其难也!
----------------------------------------------------------------------


hanganalyze 用于当数据库出现hang或者严重性能问题时,对数据库阻塞进程和被阻塞进程进行分析,因为人工分析可能比较复杂,
那么借助hanganalyze工具可以快速发现数据库出现问题的根因,找出阻塞者,继而处理问题。另外oracle还提供了一个hangfg工具
包,自动收集hang文件包括systemstate,在RAC下也可以使用,非常方便,但是这个脚本需要稍加修改,linux下报错,主要是shell
路径问题,关于hangfg使用方法可以到mos上下载使用。建议还是不要使用,应该非常熟练hanganalyze和systemstate命令,这个是DBA
基本技能。要不然就会觉得很low对不对!


为什么数据库会出现hang呢?比如TM/TX锁,也就是表对象上面存在的锁以及死锁oracle会等待或者自动解除死锁,如果锁是
发生在内存结构上出现死锁,也就是循环等待资源或者长时间等待资源不能释放,这种类型的锁或死锁oracle是不能自动解除。
那么这种情况就会造成数据库HANG。有时候不是真正的hang可能只是性能问题,需要我们诊断。


oracle在出现hang或者急剧的性能问题会自动dump systemstate,systenstate就是SGA中对象的dump,也就是所有进程的dump。
(这个有点类似weblogic JAVA dump 文件,可以分析java的GC机制和OOM问题,IBM有一个非常好的OOM分析工具大家可以使用)
systemstate信息量较大人工分析困难,如果不熟练可能不好分析,可以借助ass.awk 工具进行分析,这个工具会把systemstate内容
分类整理成一个比较容易看懂的文件。使用方法很简单:awk -f ass.awk /home/oracle/yyjc1_ora_20938.trc 。
如果系统hang不能自动dump systemstate 那么我们可以借助操作系统debug 工具来完成systemstate:


首先需要在系统查找一个oracle 进程,最好不要使用后台进程,然后依据操作系统执行下面操作即可,
那么就会在oracle trace目录生成systemstate 文件,文件名为ora_PID.trc.

dbx debugger syntax example:
$ dbx -a PID
dbx() print ksudss(258)
dbx() detach
gdb debugger syntax example:
$ gdb $ORACLE_HOME/bin/oracle PID
print ksudss(258)

给一个linux下gdb 使用案例:
ps -ef |grep 9314
oracle    9314  9291  0 02:44 pts/0    00:00:00 sqlplus   as sysdba
oracle    9315  9314  0 02:44 ?        00:00:00 oracleprimary (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) --找到sqlplus   as sysdba主进程
gdb /u01/app/oracle/product/11.2.0/db_1/bin/oracle 9315  --attache该进程
(gdb) print ksudss(258) --打印systemstate 信息
然后去 trace目录找trc文件。

awk -f ass.awk /home/oracle/yyjc1_ora_20938.trc 结果如下:
Blockers
~~~~~~~~

        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.

                    Resource Holder State
    Rcache object=b7b796ed8,   325: waiting for 'cursor: pin S wait on X'
    Rcache object=b7b85a8b0,   460: waiting for 'cursor: pin S wait on X'
    Rcache object=b7b2edc78,   317: waiting for 'cursor: pin S wait on X'
    Rcache object=b7bdd7ed8,    81: waiting for 'cursor: pin S wait on X'
    Rcache object=b7ba62950,   311: waiting for 'cursor: pin S wait on X'
    Rcache object=b7bb68bc8,   410: waiting for 'cursor: pin S wait on X'
    Rcache object=b7ba637a0,   301: waiting for 'cursor: pin S wait on X'
    Rcache object=b7bddf8c0,   307: waiting for 'SGA: allocation forcing component growth'
    Rcache object=b7b1916d0,   329: waiting for 'cursor: pin S wait on X'
    Rcache object=a59b61d18,   345: waiting for 'cursor: pin S wait on X'
    Rcache object=b7bb8a8c0,   334: waiting for 'cursor: pin S wait on X'
    Rcache object=afc96ed00,   400: waiting for 'SGA: allocation forcing component growth'
    Rcache object=b7b2ed550,   504: waiting for 'cursor: pin S wait on X'
    Rcache object=b7be45b98,   614: waiting for 'cursor: pin S wait on X'
    Rcache object=b7befb3a8,   633: waiting for 'cursor: pin S wait on X'

Object Names
~~~~~~~~~~~~
Rcache object=b7b796ed8,                                      
Rcache object=b7b85a8b0,                                      
Rcache object=b7b2edc78,                                      
Rcache object=b7bdd7ed8,                                      
Rcache object=b7ba62950,                                      
Rcache object=b7bb68bc8,                                      
Rcache object=b7ba637a0,                                      
Rcache object=b7bddf8c0,                                      
Rcache object=b7b1916d0,                                      
Rcache object=a59b61d18,                                      
Rcache object=b7bb8a8c0,                                      
Rcache object=afc96ed00,                                      
Rcache object=b7b2ed550,                                      
Rcache object=b7be45b98,                                      
Rcache object=b7befb3a8,                                      

是不是非常清楚!


对于hanganalyze 如果真的hang不能登录那么就需要sqlplus "preliminary connection"
sqlplus -prelim '/ as sysdba',但是11.2.0.2以后的版本使用方法有所不同,下面会有介绍


那么结合hanganalyze 和 systemsate 就能判断出系统问题出在哪里,分析问题的目的就是避免问题再次出现。


下面介绍hanganalyze:

注意11.2.0.2 以后preliminary connection使用方法有变化:

NOTE: From 11.2.0.2 onwards, hanganalyze will not produce output under a sqlplus "preliminary connection"
since it requires a process state object and a session state object. If a hanganalyze is attempted, although
the hanganalyze will appear to be successful:

SQL>  oradebug hanganalyze 3
Statement processed.

the tracefile will contain the following output:

HANG ANALYSIS:

ERROR: Can not perform hang analysis dump without a process state object and a session state object.
( process=(nil), sess=(nil) )

As a workaround connect to an existing process id, for example PMON/SMON, before performing the dump :
sqlplus - prelim / as sysdba
oradebug setospid < use an existing process id, for example PMON/SMON>
oradebug hanganalyze 3

具体示例:

sqlplus -prelim '/ as sysdba'

SQL> ! ps -ef |grep pmon
oracle    4186     1  0 Aug09 ?        00:00:14 ora_pmon_primary
oracle    9149  9144  0 02:23 pts/1    00:00:00 /bin/bash -c  ps -ef |grep pmon
oracle    9151  9149  0 02:23 pts/1    00:00:00 grep pmon

SQL> oradebug setospid 4186
Oracle pid: 2, Unix process pid: 4186, image: oracle@piamarydb (PMON) --或者使用SMON
SQL> oradebug hanganalyze 3;
Statement processed.
SQL> oradebug tracefile_name;
/u01/app/oracle/diag/rdbms/primary/primary/trace/primary_pmon_4186.trc

单节点脚本:
Collection commands for Hanganalyze and Systemstate: Non-RAC:

Hanganalyze

sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug hanganalyze 3
-- Wait one minute before getting the second hanganalyze
oradebug hanganalyze 3
oradebug tracefile_name
exit
Systemstate

sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug dump systemstate 258
oradebug dump systemstate 258
oradebug tracefile_name
exit

RAC 脚本:
Collection commands for Hanganalyze and Systemstate: RAC
There are 2 bugs affecting RAC that without the relevant patches being applied on your system, make using level 266 or 267 very costly.
Therefore without these fixes in place it highly inadvisable to use these level

For information on these patches see:

Document 11800959.8 Bug 11800959 - A SYSTEMSTATE dump with level >= 10 in RAC dumps huge BUSY GLOBAL CACHE ELEMENTS - can hang/crash instances
Document 11827088.8 Bug 11827088 - Latch 'gc element' contention, LMHB terminates the instance

Note: both of these bugs are resolved in Oracle 11g Release 2 version 11.2.0.3.


For 11g (and above):
sqlplus '/ as sysdba'
oradebug setmypid
oradebug  unlimit
oradebug -g all hanganalyze 3
oradebug -g all hanganalyze 3
oradebug -g all dump systemstate 258
oradebug -g all dump systemstate 258
exit

For 10g, run as following:

sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug -g all hanganalyze 3
oradebug -g all hanganalyze 3
oradebug -g all dump systemstate 258
oradebug -g all dump systemstate 258
exit
In RAC environment, a dump will be created for all RAC instances in the DIAG trace file for each instance.




下面是对各个状态的说明:

The following describes the important states to be considered:


IN_HANG: This might be considered as the most critical STATE. Basically a node in this state is involved in a deadlock, or is hung.
Usually there will be another “adjacent node” in the same status. For example:

[nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor
[16]/0/17/154/0x24617be0/26800/IN_HANG/29/32/[185]/19
[185]/1/16/4966/0x24617270//IN_HANG/30/31/[16]/16
In this example the node [16] is waiting for node [185], and the other way around; this is  a cyclical condition (deadlock).


LEAF and LEAF_NW: Leaf nodes are considered on top of the wait chain (usually blockers). They are considered “Blockers”
when there is another session waiting. This can be easily identified using the “predecesor” field. If there is a node referenced in the ‘prdecessor’ field, the

node
is considered as “blocker”, otherwise it is considered as a “slow” session waiting for some resource.
The difference between LEAF and LEAF_NW is that LEAF nodes are waiting for something, while LEAF_NW are not waiting or may be
using the CPU. A typical representation of these nodes when they are considered blockers is:
[ nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor
[16]/0/17/154/0x24617be0/26800/LEAF/29/30//19
[19]/0/20/13/0x24619830/26791/NLEAF/33/34/[16]/186

In this example, node [16] is blocking node [19]. Notice that node [16] has node [19] in the predecessor field.
Also notice that node [19] has node [16] in the adjacent list.


NLEAF : These sessions are usually considered as “stuck” sessions. It means that there is another session holding a resource
needed by the session in this state. By using the adjlist, you can determine which node is the blocker of this process. When many sessions are found in this state,
it is likely the database is experiencing a performance problem rather than a hang problem.



IGN and IGN_DMP : Sessions in this state are usually considered as IDLE sessions, unless they reference a node in the “ adjlist” field.
In this case, the node is waiting for another node, so it will be considered as a ‘stuck’ session as well.
Extending the previous example,
[nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor
[16]/0/17/154/0x24617be0/26800/LEAF/29/30//19
[19]/0/20/13/0x24619830/26791/NLEAF/33/34/[16]/186
[189]/1/20/36/0x24619830//IGN/95/96/[19]/none
[176]/1/7/1/0x24611d80//IGN/75/76//none

You may notice that node [189] is waiting for node [19] which in turn is waiting for node [16], while node [176] is an IDLE session .
This maybe the case when a session has a DML lock but never finished the transaction.

In Oracle9i, two new states were introduced to differentiate between LEAF nodes that have other nodes waiting behind them
(i.e., LEAF nodes that are blockers) vs. LEAF nodes that are not affecting other nodes.


SINGLE_NODE and SINGLE_NODE_NW:
This can be described the same as LEAF and LEAF_NW, except that they don't have processes depending on them.










您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

TOP技术积分榜 社区积分榜 徽章 团队 统计 知识索引树 积分竞拍 文本模式 帮助
  ITPUB首页 | ITPUB论坛 | 数据库技术 | 企业信息化 | 开发技术 | 微软技术 | 软件工程与项目管理 | IBM技术园地 | 行业纵向讨论 | IT招聘 | IT文档
  ChinaUnix | ChinaUnix博客 | ChinaUnix论坛
CopyRight 1999-2011 itpub.net All Right Reserved. 北京盛拓优讯信息技术有限公司版权所有 联系我们 
京ICP备16024965号 北京市公安局海淀分局网监中心备案编号:11010802021510 广播电视节目制作经营许可证:编号(京)字第1149号
  
快速回复 返回顶部 返回列表