|
描述:
db2 connect to <dbname> 无法连接数据库
背景介绍:
aix平台
系统原先运行良好,而后用户从fixpak 7b升级到fixpak 15。
在box中有3个实例,其中一个实例在db2iupdt后无法连接数据库。实例中仅有一个数据库。
问题诊断:
哈哈,这种问题来了以后第一个要做得是什么?
cd ~/sqllib/db2dump
mv db2diag.log db2diag.log.bak
<reproduce problem>
vi db2diag.log
得到下面的东西:
2007-12-12-20.42.45.029130-360 I1838C472 LEVEL: Warning
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:15
RETCODE : ZRC=0x820F0004=-2112946172=SQLO_MEM_SIZE "Mem Mgt invalid size"
DIA8563C An invalid memory size was requested.
2007-12-12-20.42.45.029814-360 I2311C755 LEVEL: Warning
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:16
DATA #1 : String, 297 bytes
Failed to allocate the desired database shared memory set.
Check to make sure the configured DATABASE_MEMORY + overflow
does not exceed the maximum shared memory on the system. Will
attempt to allocate the minimum possible db shared memory size.
Desired database shared memory set size is (bytes):
DATA #2 : Hexdump, 4 bytes
0x2FF13E20 : C601 4000 ..@.
2007-12-12-20.42.45.030148-360 I3067C491 LEVEL: Severe
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:17
RETCODE : ZRC=0x850F0005=-2062614523=SQLO_NOSEG
"No Storage Available for allocation"
DIA8305C Memory allocation failure occurred.
2007-12-12-20.42.45.030341-360 I3559C646 LEVEL: Severe
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:18
DATA #1 : String, 189 bytes
Failed to allocate the minimum possible database shared memory set.
Will now attempt to start up with a single small bufferpool.
Minimum possible database shared memory set size is (bytes):
DATA #2 : Hexdump, 4 bytes
0x2FF13E1C : ACC0 CCCC ....
2007-12-12-20.42.45.031566-360 I4206C472 LEVEL: Warning
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:15
RETCODE : ZRC=0x820F0004=-2112946172=SQLO_MEM_SIZE "Mem Mgt invalid size"
DIA8563C An invalid memory size was requested.
2007-12-12-20.42.45.031762-360 I4679C755 LEVEL: Warning
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:16
DATA #1 : String, 297 bytes
Failed to allocate the desired database shared memory set.
Check to make sure the configured DATABASE_MEMORY + overflow
does not exceed the maximum shared memory on the system. Will
attempt to allocate the minimum possible db shared memory size.
Desired database shared memory set size is (bytes):
DATA #2 : Hexdump, 4 bytes
0x2FF13E20 : BEBE 0000 ....
2007-12-12-20.42.45.032023-360 I5435C491 LEVEL: Severe
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:17
RETCODE : ZRC=0x850F0005=-2062614523=SQLO_NOSEG
"No Storage Available for allocation"
DIA8305C Memory allocation failure occurred.
2007-12-12-20.42.45.032214-360 I5927C646 LEVEL: Severe
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:18
DATA #1 : String, 189 bytes
Failed to allocate the minimum possible database shared memory set.
Will now attempt to start up with a single small bufferpool.
Minimum possible database shared memory set size is (bytes):
DATA #2 : Hexdump, 4 bytes
0x2FF13E1C : A665 D333 .e.3
2007-12-12-20.42.45.032412-360 I6574C491 LEVEL: Severe
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:20
RETCODE : ZRC=0x850F0005=-2062614523=SQLO_NOSEG
"No Storage Available for allocation"
DIA8305C Memory allocation failure occurred.
相信大家应该已经知道发生什么问题了吧
啥?不知道?给个提示,下面的部分说明了什么咚咚?
2007-12-12-20.42.45.029814-360 I2311C755 LEVEL: Warning
PID : 1593388 TID : 1 PROC : db2agent (U88FLAQ) 0
INSTANCE: iu88flaq NODE : 000 DB : U88FLAQ
APPHDL : 0-7 APPID: *LOCAL.iu88flaq.071213024245
FUNCTION: DB2 UDB, base sys utilities, sqleFirstConnect, probe:16
DATA #1 : String, 297 bytes
Failed to allocate the desired database shared memory set.
Check to make sure the configured DATABASE_MEMORY + overflow
does not exceed the maximum shared memory on the system. Will
attempt to allocate the minimum possible db shared memory size.
Desired database shared memory set size is (bytes):
DATA #2 : Hexdump, 4 bytes
0x2FF13E20 : C601 4000 ..@.
尝试分配0xC6014000的内存失败!为啥?看到了SQLO_MEM_SIZE和SQLO_NOSEG了,而且size_t是32位的,还用继续分析么?
直接db2level,一看果然是32位实例,立刻仰天长笑三声,让用户db2iupdt -w 64一遍这个问题不就搞定了……
可是敌人是顽固疯狂的,是无处不在的!用户db2iupdt以后还是弱弱地问:为啥俺还是连不上捏?
寒,不是吧,怎么还连不上?
把第一招再次祭出来,抓新的db2diag.log。
这一次错误信息果然换了一个:
2007-12-14-12.15.17.656977-360 I1A1093 LEVEL: Event
PID : 1052786 TID : 1 PROC : db2
INSTANCE: iu88flaq NODE : 000
FUNCTION: DB2 UDB, RAS/PD component, _pdlogInt, probe:120
START : New db2diag.log file
DATA #1 : Build Level, 144 bytes
Instance "iu88flaq" uses "64" bits and DB2 code release "SQL08028"
with level identifier "03090106".
Informational tokens are "DB2 v8.1.1.136", "s070720", "U810952", FixPak "15".
DATA #2 : System Info, 224 bytes
System: AIX cosuadb01 3 5 00027F15D600
CPU: total:8 online:8 Threading degree per core:2
Physical Memory(MB): total:7936 free:1527
Virtual Memory(MB): total:16128 free:7967
Swap Memory(MB): total:8192 free:6440
Kernel Params: msgMaxMessageSize:4194304 msgMaxQueueSize:4194304
shmMax:68719476736 shmMin:1 shmIDs:131072
shmSegments:68719476736 semIDs:131072 semNumPerID:65535
semOps:1024 semMaxVal:32767 semAdjustOnExit:16384
Information in this record is only valid at the time when this file was
created (see this record's time stamp)
2007-12-14-12.15.17.655334-360 I1095A383 LEVEL: Error
PID : 1052786 TID : 1 PROC : db2
INSTANCE: iu88flaq NODE : 000
FUNCTION: DB2 UDB, command line process, clp_start_bp, probe:3
MESSAGE : CLP frontend unable to get REQUEST queue handle
DATA #1 : Hexdump, 4 bytes
0x0FFFFFFFFFFFD6C8 : 870F 0042 ...B
2007-12-14-12.15.45.546601-360 I1479A383 LEVEL: Error
PID : 2765134 TID : 1 PROC : db2
INSTANCE: iu88flaq NODE : 000
FUNCTION: DB2 UDB, command line process, clp_start_bp, probe:3
MESSAGE : CLP frontend unable to get REQUEST queue handle
DATA #1 : Hexdump, 4 bytes
0x0FFFFFFFFFFFD688 : 870F 0042 ...B
看看870F0042是啥:
Input ZRC string '0x870f0042' parsed as 0x870F0042 (-2029060030).
ZRC value to map: 0x870F0042 (-2029060030)
V7 Equivalent ZRC value: 0xFFFFF642 (-2494)
ZRC class :
Global Processing Error (Class Index: 7)
Component:
SQLO ; oper system services (Component Index: 15)
Reason Code:
66 (0x0042)
Identifer:
SQLO_QUE_NOT_EXIST
Identifer (without component):
SQLZ_RC_QNOEXS
Description:
Queue does not exist
Associated information:
Sqlcode -902
SQL0902C A system error (reason code = "" occurred. Subsequent SQL
statements cannot be processed.
Number of sqlca tokens : 1
Diaglog message number: 8558
告诉我们queue doesn't exist! 怎么办?
首先我们要理解一些这个queue doesn't exist是什么咚咚。
当我们用db2 xxxx命令作事情的时候,这个db2是一个可执行程序。这个可执行程序被称作frontend process。当db2这个进程执行的时候,会隐式地在后台生成一个db2bp的进程。这个进程叫做db2 backend process。所以说,当我们用db2 connect to <db>这个命令连接数据库的时候,真正与数据库相连接的是其后台的db2bp进程。而当连接成功以后,后面所有的比如select操作什么的,都是通过queue这种IPC手段与db2bp进程进行通讯的。
这里我们得到的错误信息就是说,当db2 frontend process想找backend process的时候,发现用来通讯的queue不存在!
我们知道既然是IPC通讯,那frontend process就不可能只是得到一个queue不存在就立刻报错。所以问一下用户:您老得到错误的时候是在issue了connect命令后立刻得到的还是等了一阵呀。用户也弱弱地答:当然是等了一阵咩~~~
哼哼,现在该做什么了?当然是要看看这个后台进程正在做什么了,竟然让前台进程等了半天也拿不到queue,人家不得不退出说queue不存在了……
怎么看?还用问?db2trc for the win!
因此祭起第二招法宝
db2trc on -t -f db2trc.dmp
<reproduce problem>
db2trc off
db2trc flw -t db2trc.dmp db2trc.flw
db2trc fmt db2trc.dmp db2trc.fmt
出来的flw文件不大,就40多M,大概51万行
(比动辄上G的其他复杂问题小多了)
别被这51万吓着,让我全看?没那闲情雅趣。所以说看trace最关键的就是要知道,你现在正在干什么,想要看什么东西。
我们现在正在调查backend process为啥不创建queue,首先让我们从得到的db2diag.log入手。
先到fmt文件里找关键字"2007-"(不包括引号)
(问我为啥找这个关键字?那您觉得用什么关键字才能够最快地找到dump到db2diag.log文件里的内容呢?)
只得到了一个entry:
2170 data DB2 UDB command line process clp_start_bp fnc (3.3.41.7.0.3)
pid 2765134 tid 1 cpid -1 node -1 sec 64 nsec 869465587 probe 3
bytes 392
Data1 (PD_TYPE_DIAG_LOG_REC,384) Diagnostic log record:
2007-12-14-12.15.45.546601-360 I1479A383 LEVEL: Error
PID : 2765134 TID : 1 PROC : db2
INSTANCE: iu88flaq NODE : 000
FUNCTION: DB2 UDB, command line process, clp_start_bp, probe:3
MESSAGE : CLP frontend unable to get REQUEST queue handle
DATA #1 : Hexdump, 4 bytes
0x0FFFFFFFFFFFD688 : 870F 0042 ...B
看到没?2170对应着flw文件里的2170, 然后再这个entry里我们看到clp_start_bp,顾名思义,就是clp用来开启backend process的function call咯,也就是说,这个2170是在frontend process中的。
然后回到flw文件,找2170:
2166 64:869290682 | | | | | | | sqlochgfileptr exit
2167 64:869405455 | | | | | | _pdlogInt data [probe 90]
2168 64:869450954 | | | | | | | sqlowrite entry
2169 64:869464745 | | | | | | | sqlowrite exit
2170 64:869465587 | | | clp_start_bp data [probe 3]
2171 64:869466392 | | | | | | | sqloclose entry
2172 64:869472755 | | | | | | | sqloclose exit
2173 64:869479918 | | | | | | _pdlogInt exit
2174 64:869480276 | | | | | pdLog exit
好家伙,64秒才跑到2170,往上看,看看它一直在干啥:
吼吼,我们看到了好多好多OpenMLNQue啊,而且是基本一秒钟一个
1437 22:867813790 | | | | | sqloOpenMLNQue data [probe 1]
1438 22:867817987 | | | | | | sqlogkey entry
1439 22:867818164 | | | | | | sqlogkey data [probe 1]
1440 22:867819491 | | | | | | sqlogkey exit [rc = 0x6F02C86F = 1862453359]
1441 22:867823317 | | | | | sqloOpenMLNQue exit [rc = 0x870F0042 = -2029060030 = SQLO_QUE_NOT_EXIST]
1442 22:867823675 | | | | | sqlorest entry
1443 22:867823869 | | | | | sqlorest data [probe 10]
1444 23:867842081 | | | | | sqlorest exit
1445 23:867845317 | | | | | sqloOpenMLNQue entry
1446 23:867847142 | | | | | sqloOpenMLNQue data [probe 1]
1447 23:867850175 | | | | | | sqlogkey entry
1448 23:867850525 | | | | | | sqlogkey data [probe 1]
1449 23:867852012 | | | | | | sqlogkey exit [rc = 0x6F02C86F = 1862453359]
1450 23:867855105 | | | | | sqloOpenMLNQue exit [rc = 0x870F0042 = -2029060030 = SQLO_QUE_NOT_EXIST]
1451 23:867855472 | | | | | sqlorest entry
1452 23:867855657 | | | | | sqlorest data [probe 10]
1453 24:867872812 | | | | | sqlorest exit
1454 24:867876195 | | | | | sqloOpenMLNQue entry
1455 24:867877582 | | | | | sqloOpenMLNQue data [probe 1]
1456 24:867880910 | | | | | | sqlogkey entry
1457 24:867881256 | | | | | | sqlogkey data [probe 1]
1458 24:867882545 | | | | | | sqlogkey exit [rc = 0x6F02C86F = 1862453359]
1459 24:867885558 | | | | | sqloOpenMLNQue exit [rc = 0x870F0042 = -2029060030 = SQLO_QUE_NOT_EXIST]
说明了啥?frontend process每秒钟检测一次backend process queue是不是avaliable,然后等了一分钟左右发现一直找不到,然后就timeout退出拉。
不过现在还是没有说backend process在这段时间干了些什么呀~~~怎么办?
怎么办?既然我们知道了这个process是frontend process,那么backend process肯定在另外的地方咯。而且想要知道backend process在干什么,就要找这64秒之前的东西,像那些100多秒以后的事情就可以忽略不计了。
但是这里可是有51万行呢,怎么找?我们知道一个程序的入口都是main,所以我们来搜索一下关键字main,得到下面的东西:
除了第一个clp_bp_main,其他的都是100秒之后的事情,不用管,去第一个看。
从名字上听起来,clp_bp_main不就是CommandLineProcess_BackendProcess_main么,那也就是后台进程的入口咯。
pid = 3002686 tid = 1 node = 0
804 4:867219779 sqloDirectStandardFileDescriptorsToDevNull entry
810 4:867250656 | sqloAddOneReservedHandle entry
811 4:867265542 | sqloAddOneReservedHandle data [probe 10]
812 4:867266195 | sqloAddOneReservedHandle exit
813 4:867266457 | sqloAddOneReservedHandle entry
814 4:867266882 | sqloAddOneReservedHandle data [probe 10]
815 4:867267114 | sqloAddOneReservedHandle exit
817 4:867267312 | sqloAddOneReservedHandle entry
819 4:867267556 | sqloAddOneReservedHandle data [probe 10]
820 4:867268020 | sqloAddOneReservedHandle exit
821 4:867268268 sqloDirectStandardFileDescriptorsToDevNull exit
826 4:873512436 clp_bp_main entry
827 4:873535442 | sqleInitApplicationEnvironment entry
828 4:873548871 | | sqloGetEnvInternal entry
829 4:873557214 | | | EnvRegGetProfile entry
830 4:873558276 | | | EnvRegGetProfile exit
831 4:873558815 | | | EnvPrfGetValueByEnumIndex entry
832 4:873559169 | | | EnvPrfGetValueByEnumIndex exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
833 4:873559409 | | | EnvRegGetProfile entry
834 4:873559809 | | | EnvRegGetProfile exit
835 4:873560012 | | | EnvPrfGetValueByEnumIndex entry
836 4:873560306 | | | EnvPrfGetValueByEnumIndex exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
837 4:873560884 | | sqloGetEnvInternal exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
838 4:873569808 | | sqloxltc_app entry
839 4:873571249 | | sqloxltc_app exit
840 4:873571666 | | sqleBeginTypedCtxInternal entry
而那个对应的进程id 3002686就应该是其backend process的pid了
知道了这个,现在我们能看出,用户发出db2 connect to <dbname>命令是在trace开启后4。8秒的时候,然后继续往下看,看看bp都在干什么。
往下卷了大概不到1000行,我们突然发现了几个时间戳的 big jump
1235 4:893112670 | | | | sqloPdbTcpipGetFullHostName entry
1236 4:893114440 | | | | | sqloPdbTcpIpResolveHostName entry
1237 4:893114912 | | | | | | sqloPdbGetHostByName entry
1238 4:893115316 | | | | | | sqloPdbGetHostByName data [probe 7]
1462 25:10592207 | | | | | | sqloPdbGetHostByName data [probe 25]
1463 25:10594709 | | | | | | sqloPdbGetHostByName exit
1464 25:10595152 | | | | | sqloPdbTcpIpResolveHostName exit
1465 25:10596028 | | | | sqloPdbTcpipGetFullHostName exit
1466 25:10599298 | | | | sqlogmblkEx entry
1467 25:10628001 | | | | sqlogmblkEx mbt [Marker:PD_OSS_ALLOCATED_MEMORY ]
1468 25:10629198 | | | | sqlogmblkEx exit
1469 25:10630862 | | | | sqloGetNextNodeList entry
1470 25:10631216 | | | | | sqloStreamFileGetString entry
1471 25:10631679 | | | | | | sqlovsfh entry
1472 25:10631991 | | | | | | sqlovsfh exit
1473 25:10634789 | | | | | | sqloread entry
1474 25:10649979 | | | | | | sqloread exit [rc = 0x870F0009 = -2029060087 = SQLO_EOF]
1475 25:10652511 | | | | | sqloStreamFileGetString exit
1476 25:10655747 | | | | sqloGetNextNodeList data [probe 5]
1477 25:10698043 | | | | sqloGetNextNodeList exit
1478 25:10699779 | | | | sqloPdbTcpIpResolveHostName entry
1479 25:10700201 | | | | | sqloPdbGetHostByName entry
1480 25:10700605 | | | | | sqloPdbGetHostByName data [probe 7]
1810 45:128224793 | | | | | sqloPdbGetHostByName data [probe 25]
1811 45:128226786 | | | | | sqloPdbGetHostByName exit
1812 45:128227114 | | | | sqloPdbTcpIpResolveHostName exit
1813 45:128228218 | | | | sqloPdbTcpIpResolveHostName entry
1814 45:128229057 | | | | | sqloPdbGetHostByName entry
1815 45:128230102 | | | | | sqloPdbGetHostByName data [probe 7]
2259 65:245641277 | | | | | sqloPdbGetHostByName data [probe 25]
2260 65:245643055 | | | | | sqloPdbGetHostByName exit
2261 65:245643514 | | | | sqloPdbTcpIpResolveHostName exit
2262 65:245663318 | | | | sqloGetNextNodeList entry
2263 65:245663841 | | | | | sqloStreamFileGetString entry
2264 65:245664308 | | | | | | sqlovsfh entry
2265 65:245664700 | | | | | | sqlovsfh exit
2266 65:245666685 | | | | | | sqloread entry
2267 65:245671716 | | | | | | sqloread exit [rc = 0x870F0009 = -2029060087 = SQLO_EOF]
2268 65:245672445 | | | | | sqloStreamFileGetString error [probe 15]
2269 65:245672908 | | | | | sqloStreamFileGetString exit [rc = 0x870F0009 = -2029060087 = SQLO_EOF]
2270 65:245673296 | | | | sqloGetNextNodeList exit [rc = 0x870F0009 = -2029060087 = SQLO_EOF]
2271 65:245674354 | | | | sqlopartfmblk entry
2272 65:245676469 | | | | sqlopartfmblk exit
从4秒跳到25秒,然后跳到45秒,最后跳到65秒。那65秒之后呢?笨涅,64秒front end process就出错退出了,还看65秒之后的干什么?
看看这几个大的跳跃:
1238 4:893115316 | | | | | | sqloPdbGetHostByName data [probe 7]
1462 25:10592207 | | | | | | sqloPdbGetHostByName data [probe 25]
1480 25:10700605 | | | | | sqloPdbGetHostByName data [probe 7]
1810 45:128224793 | | | | | sqloPdbGetHostByName data [probe 25]
1815 45:128230102 | | | | | sqloPdbGetHostByName data [probe 7]
2259 65:245641277 | | | | | sqloPdbGetHostByName data [probe 25]
都是同样的function,同样的位置。这个function是干啥的?GetHostByName……不陌生吧,DNS 地址解析嘛!!
下面要做得不用多说了吧,直接找到db2nodes.cfg,看看hostname,然后ping一下看看?用了30秒才返回第一个package!
得了,咱也不研究DNS出什么问题了,反正和db2无关。停了实例,把db2nodes.cfg里面的hostname改成ipaddress,然后connect一次,嘿嘿,搞定了!
[ 本帖最后由 wangzhonnew 于 2007-12-21 03:24 编辑 ] |
|