12
返回列表 发新帖
楼主: owen

一次数据库起不来的经历,供参考

[复制链接]
论坛徽章:
3
ITPUB元老
日期:2005-02-28 12:57:00授权会员
日期:2005-10-30 17:05:33会员2006贡献徽章
日期:2006-04-17 13:46:34
11#
 楼主| 发表于 2001-12-7 21:04 | 只看该作者

高手!to overtime

楼上兄弟的解释是从哪儿来的,我还有些疑问,能不能直接交流一下  010-62200868-3309
      13951670930

使用道具 举报

回复
论坛徽章:
126
咸鸭蛋
日期:2011-08-22 23:47:37优秀写手
日期:2013-12-18 09:29:152014年新春福章
日期:2014-02-18 16:41:11马上有车
日期:2014-02-18 16:41:11马上有车
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上加薪
日期:2014-02-19 11:55:142015年新春福章
日期:2015-03-04 14:19:11
12#
发表于 2001-12-7 23:16 | 只看该作者
owen, 虽然有很多OS号称是你可以24X7永远run下去,系统是需要定期reboot 的,你可以少掉很多麻烦的。比较稳定的系统可以一个月到四个月,如果你半年还没有一次reboot的话,问题分分钟可能发生,而且很多你查不出原因的。
我曾经有一个data warehouse 系统在NT 上面,有段时间平均每个月死一次,要命的是那个机器还是PDC,连着了三个月,我不干了,去找server group 的主管吵架,现在每周reboot一次,差不多有一年多没问题了。
关于shutdown hang 的问题,如果你在做shutdown immediate之前能kill all user sessions,一般不会hang 很久的,就算shutdown abort, then startup again, shutdown immediate 也没什么大不了的。我没见过shutdown abort 之后起不来,如果你见过,讲给我听听呀。

使用道具 举报

回复
论坛徽章:
5
授权会员
日期:2005-10-30 17:05:33生肖徽章2007版:鸡
日期:2008-01-02 17:35:53生肖徽章2007版:鼠
日期:2008-01-02 17:35:532015年新春福章
日期:2015-03-04 14:19:112015年新春福章
日期:2015-03-06 11:57:31
13#
发表于 2001-12-7 23:39 | 只看该作者
这是去年8月sun发布的early notify , 在sunsolv上可以查到,oracle的metalink也应该有的。

使用道具 举报

回复
论坛徽章:
5
授权会员
日期:2005-10-30 17:05:33生肖徽章2007版:鸡
日期:2008-01-02 17:35:53生肖徽章2007版:鼠
日期:2008-01-02 17:35:532015年新春福章
日期:2015-03-04 14:19:112015年新春福章
日期:2015-03-06 11:57:31
14#
发表于 2001-12-8 00:15 | 只看该作者
Versions Affected
~~~~~~~~~~~~~~~~~
  Oracle Server and Oracle Enterprise Server 7.X through 8.1.6

Platforms Affected
~~~~~~~~~~~~~~~~~~
  UNIX GENERIC

Description
~~~~~~~~~~~
  Due to Oracle <Bug:1084273> a timer overflow will cause background
  processes to loop indefinitely.  A timer overflow occurs when the number
  of clock ticks exceeds the positive representation of datatype in which
  the value is being stored.

Likelihood of Occurrence
~~~~~~~~~~~~~~~~~~~~~~~~
  Operating systems with clock ticks set to milliseconds will see this
  problem after 24.8 days but typically systems have clock ticks set
  to centiseconds which means the problem would not be seen for 248 days.  

Possible Symptoms
~~~~~~~~~~~~~~~~~
  After the number of clock ticks since the machine was lasted rebooted
  overflows you will not be able to shutdown or startup the affected
  Oracle RDBMS products.  If you operating system provides a system call
  trace such as Sun's truss utility you can check for the behavior.
  
    % truss -af -o <output_file> -p <pid_of_pmon_process>

  If you see output similar to the following you are looping and may be
  experiencing this problem:

   24369:  semop(720897, 0xEFFFE7A0, 1)    (sleeping...)
   24448:      Received signal #14, SIGALRM, in semop() [caught]
   24448:  semop(720897, 0xEFFFDAA8, 1)                    Err#91 ERESTART
   24448:  sigprocmask(SIG_BLOCK, 0xEFFFD6C0, 0x00000000)  = 0
   24448:  times(0xEFFFD650)                               = -2117821797
   24448:  setitimer(ITIMER_REAL, 0xEFFFD650, 0x00000000)  = 0
   24448:  sigprocmask(SIG_UNBLOCK, 0xEFFFD6C0, 0x00000000) = 0
   24448:  setcontext(0xEFFFD790)
   24377:  semop(720897, 0xEFFFE7A0, 1)    (sleeping...)
   24398:      Received signal #14, SIGALRM, in semop() [caught]
   24398:  semop(720897, 0xEFFFE7A0, 1)                    Err#91 ERESTART
   24398:  sigprocmask(SIG_BLOCK, 0xEFFFE3B8, 0x00000000)  = 0
   24398:  times(0xEFFFE348)                               = -2117821686

  Note that the "times" call is returning a very small negative value.
  It is also important to understand that negative values returned by
  times is not the problem but how the Oracle timer checks against it is
  the issue.  It is normal to see negative values returned by "times"
  on systems that have been up over 248 days.

  Any running instances will need to be aborted with a "shutdown abort"
  before the system is shutdown.


Questions & Answers
~~~~~~~~~~~~~~~~~~~~~
Q.  Is this bug really generic? Which platforms are known to be affected
by it?
A.  So far only Solaris is confirmed, and NCR is suspected. Other
platforms have not been verified.

Q.  Does this problem affect Oracle7?
A.  This problem does not affect Oracle7 on Sun Solaris. Other platforms
have not been verified.

Q.  How do I know whether my system will be impacted 24 days or 248 days
after reboot?
A.  On Solaris, the command "/usr/bin/getconf CLK_TCK" will return the
number of clock ticks per second. If the number is 1000, the system is
impacted 24 days after reboot. If the number is 100, the system is
impacted 248 days after reboot.
Another way to determine the clock ticks of your system is to use the
command:
"truss -tsysconfig time true"
On Solaris this will show:
sysconfig(_CONFIG_CLK_TCK)                      = 1000
if your system clock ticks 1000 times per second.

Q.  How do I know whether my system is close to being impacted?
A.  Use the uptime command to determine how long the system has been
running. Based on the number of clock ticks per second, you will be able
to tell when the system will be impacted. Alternatively, you can see the
current return value of the times system call using this command:
"truss -ttimes time true"
On Solaris this will show:
times(0xEFFFFC20)                               = 962090
This system was rebooted 962.09 seconds ago. After the return value of
the times() system call reaches 2147483647, it will wrap and become
-2147483648. When times() returns a negative value, your system has been
impacted.

Q.  How do I know that the patch fixed the problem?
A.  When your system has passed the impact date, without the patch
installed, shutdown abort a test database on the system, and try to
start it up. If the startup hangs during the mount phase, you have
encountered the problem. Install the patch on the test database, and
retry the startup. It should now startup fine.

Workaround
~~~~~~~~~~
  The fix for this issue is incorporated into the 8.1.7 and newer
  releases of Oracle Server and Oracle Enterprise Server.

  If a patch is not available for your operating system or Oracle
  version, see the patch list in the next section,  the workaround
  is to do the following:

    1) Stop all running databases on the server

    2) Reboot the server

  This will reset the timer and start the number of ticks back to "0".

Patches
~~~~~~~
  As of August 28, 2000 there are two patches available for Sun Solaris
  32 bit:

   8.0.6: <BUG:1265297>

     @tcpatch:/u01/patch/SUN_SOLARIS2/8.0.6.0.0/bug1265297

   8.1.6: <BUG:1227119>

     @tcpatch:/u01/patch/SUN_SOLARIS2/8.1.6.0.0/bug1227119

References
~~~~~~~~~~
  DATABASE HANGS AFTER 24 DAYS, LOOPING ON SEMOP CALL <Bug:1084273>
  @ DO NOT USE TIMES() RETURN VALUE                   <Bug:1185824>

使用道具 举报

回复

您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

TOP技术积分榜 社区积分榜 徽章 团队 统计 知识索引树 积分竞拍 文本模式 帮助
  ITPUB首页 | ITPUB论坛 | 数据库技术 | 企业信息化 | 开发技术 | 微软技术 | 软件工程与项目管理 | IBM技术园地 | 行业纵向讨论 | IT招聘 | IT文档
  ChinaUnix | ChinaUnix博客 | ChinaUnix论坛
CopyRight 1999-2011 itpub.net All Right Reserved. 北京盛拓优讯信息技术有限公司版权所有 联系我们 未成年人举报专区 
京ICP备16024965号-8  北京市公安局海淀分局网监中心备案编号:11010802021510 广播电视节目制作经营许可证:编号(京)字第1149号
  
快速回复 返回顶部 返回列表