|
最近分析trace文件比较有心得,再发一篇不需要检查源程序就可以发现问题的trace跟踪。
问题描述:
客户的db2运行在cluster上,刚开始他们出现了内存错误,停掉db2后就无法启动了,想要切换到standby上也不成功。
以下是db2diag.log:
2006-05-26-15.47.59.574000-240 I1H914 LEVEL: Event
PID : 2224 TID : 2160 PROC : db2start.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, RAS/PD component, _pdlogInt, probe:120
START : New db2diag.log file
DATA #1 : Build Level, 124 bytes
Instance "DB2" uses "32" bits and DB2 code release "SQL08024"
with level identifier "03050106".
Informational tokens are "DB2 v8.1.11.973", "s060120", "WR21365", FixPak "11".
DATA #2 : System Info, 1564 bytes
System: WIN32_NT IMG-LIB1 5.2 x86 Family 15, model 2, stepping 6
CPU: total:8 online:8 Cores per socket:1 Threading degree per core:2
Physical Memory(MB): total:5632 free:5074 available:3012
Virtual Memory(MB): total:17504 free:22209
Swap Memory(MB): total:11872 free:17135
Information in this record is only valid at the time when this file was
created (see this record's time stamp)
2006-05-26-15.47.59.574000-240 I918H1537 LEVEL: Event
PID : 2224 TID : 2160 PROC : db2start.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, base sys utilities, sqleStartStopSingleNode, probe:1130
DATA #1 : String, 39 bytes
C:\PROGRA~1\IBM\SQLLIB\bin\DB2STAR2.EXE
DATA #2 : Hexdump, 256 bytes
0x0012A8B8 : 433A 5C50 524F 4752 417E 315C 4942 4D5C C:\PROGRA~1\IBM\
0x0012A8C8 : 5351 4C4C 4942 5C62 696E 5C44 4232 5354 SQLLIB\bin\DB2ST
0x0012A8D8 : 4152 322E 4558 4500 4E4F 4D53 4700 0000 AR2.EXE.NOMSG...
0x0012A8E8 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A8F8 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A908 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A918 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A928 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A938 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A948 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A958 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A968 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A978 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A988 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A998 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0012A9A8 : 0000 0000 0000 0000 0000 0000 0000 0000 ................
2006-05-26-15.48.00.092000-240 E2457H501 LEVEL: Warning
PID : 3992 TID : 2416 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, SQO Memory Management, sqlogmshr3, probe:45
DATA #1 : <preformatted>
Set 2 size truncated from 983040000 bytes to 200015872 bytes. To avoid this truncation, either reduce your memory configuration parameters for this set or update the DB2NTMEMSIZE registry variable with the string APLD:983040000.
2006-05-26-15.48.01.222000-240 I2960H357 LEVEL: Error
PID : 3992 TID : 2416 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, bsu security, sqlex_read_aud_cfg, probe:20
RETCODE : ZRC=0x860F000A=-2045837302=SQLO_FNEX "File not found."
DIA8411C A file "" could not be found.
2006-05-26-15.48.01.222000-240 I3319H355 LEVEL: Severe
PID : 3992 TID : 2416 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, base sys utilities, sqleInitSysCtlr, probe:98
MESSAGE : DiagData
DATA #1 : Hexdump, 4 bytes
0x0123FB84 : D5FA FFFF ....
2006-05-26-15.48.01.222000-240 I3676H350 LEVEL: Severe
PID : 3992 TID : 2416 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, base sys utilities, sqleSysCtlr, probe:5
MESSAGE : DiagData
DATA #1 : Hexdump, 4 bytes
0x0123FBA8 : D5FA FFFF ....
2006-05-26-15.48.01.394000-240 I4028H500 LEVEL: Error
PID : 2224 TID : 2160 PROC : db2start.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, oper system services, sqloStartDB2Service, probe:50
DATA #1 : Hexdump, 47 bytes
0x00129BA4 : 4442 322D 3020 3A20 5468 6520 7072 6F63 DB2-0 : The proc
0x00129BB4 : 6573 7320 7465 726D 696E 6174 6564 2075 ess terminated u
0x00129BC4 : 6E65 7870 6563 7465 646C 792E 0D0A 0A nexpectedly....
2006-05-26-15.48.01.394000-240 I4530H523 LEVEL: Severe
PID : 2224 TID : 2160 PROC : db2start.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, base sys utilities, sqleStartStopSingleNode, probe:70
MESSAGE : DiagData
DATA #1 : Hexdump, 47 bytes
0x0012A278 : 4442 322D 3020 3A20 5468 6520 7072 6F63 DB2-0 : The proc
0x0012A288 : 6573 7320 7465 726D 696E 6174 6564 2075 ess terminated u
0x0012A298 : 6E65 7870 6563 7465 646C 792E 0D0A 0A nexpectedly....
2006-05-26-15.48.01.410000-240 I5055H357 LEVEL: Error
PID : 2224 TID : 2160 PROC : db2start.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, bsu security, sqlex_read_aud_cfg, probe:20
RETCODE : ZRC=0x860F000A=-2045837302=SQLO_FNEX "File not found."
DIA8411C A file "" could not be found.
2006-05-26-15.48.01.410000-240 I5414H305 LEVEL: Error
PID : 2224 TID : 2160 PROC : db2start.exe
INSTANCE: DB2 NODE : 000
MESSAGE : Audit error. sqlcode is:
DATA #1 : Hexdump, 4 bytes
0x0012E774 : EEFB FFFF ....
在这里我们可以看到第一个error是:
2006-05-26-15.48.01.222000-240 I2960H357 LEVEL: Error
PID : 3992 TID : 2416 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, bsu security, sqlex_read_aud_cfg, probe:20
RETCODE : ZRC=0x860F000A=-2045837302=SQLO_FNEX "File not found."
DIA8411C A file "" could not be found.
db2diag -rc 0x860F000A后我们得到:
Input ZRC string '0x860F000A' parsed as 0x860F000A (-2045837302).
ZRC value to map: 0x860F000A (-2045837302)
V7 Equivalent ZRC value: 0xFFFFE60A (-6646)
ZRC class :
Critical Media Error (Class Index: 6)
Component:
SQLO ; oper system services (Component Index: 15)
Reason Code:
10 (0x000A)
Identifer:
SQLO_FNEX
SQLO_MOD_NOT_FOUND
Identifer (without component):
SQLZ_RC_FNEX
Description:
File not found.
Associated information:
Sqlcode -980
SQL0980C A disk error occurred. Subsequent SQL statements cannot be
processed.
Number of sqlca tokens : 0
Diaglog message number: 8411
然后又有:2006-05-26-15.48.01.222000-240 I3319H355 LEVEL: Severe
PID : 3992 TID : 2416 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000
FUNCTION: DB2 UDB, base sys utilities, sqleInitSysCtlr, probe:98
MESSAGE : DiagData
DATA #1 : Hexdump, 4 bytes
0x0123FB84 : D5FA FFFF ....
db2diag -rc 0xfffffad5后:
Input ECF string '0xfffffad5' parsed as 0xFFFFFAD5 (-1323).
NOTE: ../sqz/sqlzwhatisrc.C:
V7 input ZRC 0xFFFFFAD5 (-1323) may translate to V8 ZRC value of 0x83000
AD5 (-2097149227)
ZRC value to map: 0x83000AD5 (-2097149227)
V7 Equivalent ZRC value: 0xFFFFBAD5 (-17707)
ZRC class :
Unexpected Operating System error (Class Index: 3)
Component:
Unknown component (Component Index: 0)
Undefined as of DB2 v8.1.10.812; s050811
Attempting to lookup value 0xFFFFFAD5 (-1323) as a sqlcode
Sqlcode -1323
SQL1323N An error occurred when accessing the audit configuration file.
看起来好像是SQL1323的错误,也就是db2audit.cfg找不到。
但是由于这个是production system,我们不能胡乱给一个猜测,一定要有confidence,对不对?而且当时客户感觉对于db2了解有限,我们不能单单告诉他是db2audit的问题,要给他们一个详细的解决方案。因而我们需要db2trace验证我们的猜测:
老套路:
db2trc on -l 10M
db2trc clr
<db2start> <---- failed here
db2trc dmp
xxxxxx
(trace文件包含在附件中)
好,下面该开始正式分析了。
先找到db2start的入口:
1830 DB2StartMain entry
找到入口处我们在找错误退出的部分:
16272 DB2StartMain exit [rc = 0xFFFFFAD5 = -1323]
确实是由于1323的错误退出。
向回跟踪,慢慢找,要有耐心:)
终于又找到一处错误点:
16094 DB2StartMain error [probe 710]
可以肯定错误是在这之前了。
继续向回跟踪:
5200 | | | EnvPrfGetValueByEnumIndex entry
5201 | | | EnvPrfGetValueByEnumIndex exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
5202 | | sqloGetEnvInternal exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
5203 | sqleSetupEnvVars exit
5204 | sqloReportServiceStatus entry
5205 | sqloReportServiceStatus exit [rc = 1]
5206 | sqloWaitIPCWaitPost entry
5207 | | sqloWaitInterrupt entry
13455 | | sqloWaitInterrupt exit [rc = 0x870F00B9 = -2029059911 = SQLO_SEM_TIMEOUT]
13456 | sqloWaitIPCWaitPost exit [rc = 0x870F00B9 = -2029059911 = SQLO_SEM_TIMEOUT]
13457 | sqloReportServiceStatus entry
13458 | sqloReportServiceStatus exit [rc = 1]
13459 | sqloWaitIPCWaitPost entry
13460 | | sqloWaitInterrupt entry
16084 | | sqloWaitInterrupt exit
16086 | sqloWaitIPCWaitPost exit
16087 | sqloReportServiceStatus entry
16093 | sqloReportServiceStatus exit [rc = 1]
我们看到程序从5207一下子就跳到了13455,并且是在sqlWaitInterrupt入口后紧接着返回SQLO_SEM_TIMEOUT。
先记住这里,bookmark上,然后再向回走,看了半天好像没有什么可疑的,那么这里的嫌疑最大。
对照fmt文件:
我们先去10694看一看:
16094 error DB2 UDB base sys utilities DB2StartMain fnc (4.3.5.304.0.710)
pid 972 tid 2300 cpid -1 node 0 sec 26 nsec 113255673 probe 710
Error ZRC = 0x00000000 = 0 = PSM_OK
bytes 12
Data1 (PD_TYPE_DEFAULT,4) Hexdump:
D5FA FFFF ....
嘿嘿,0xFFFFFAD5就是-1323,然后回头看5207和13455没有发现其他的信息,不过别放弃。
由于5207后程序转过去执行其他的进程/线程,因而产生了跳跃。所以我们应该从5208开始继续跟踪:
pid = 972 tid = 288 node = 0
5208 sqloGetEnvInternal entry
一直下拉下拉别怕麻烦:
16110 | DB2main exit
终于到了这一段进程的结束,从这里向回跟踪:
经过了无数可以忽略的IPC Wait我们眼睛一亮,突然看到了-1323这个字眼:
14824 | | | | sqleInitSysCtlr exit [rc = 0xFFFFFAD5 = -1323]
估计错误就在5208和14824之间了,继续向回跟踪:
在上面几行我们看到了:
14779 | | | | | | sqlex_build_sqlca exit [rc = 0xFFFFFAD5 = -1323]
14780 | | | | | sqlex_set_audit_state data [probe 10]
14781 | | | | | sqlex_set_audit_state exit [rc = 0xFFFFFAD5 = -1323]
你看SQLCA结构中竟然有了1323,可想而知错误就在附近,集中精力:
14646 | | | | | | sqlex_read_aud_cfg entry
14647 | | | | | | | sqlex_gen_aud_dir entry
14653 | | | | | | | | sqleinstancepath entry
14654 | | | | | | | | | sqloInstancePath entry
14655 | | | | | | | | | sqloInstancePath exit
14656 | | | | | | | | sqleinstancepath exit
14657 | | | | | | | sqlex_gen_aud_dir exit
14664 | | | | | | | sqloopenp entry
14692 | | | | | | | sqloopenp exit [rc = 0x860F000A = -2045837302 = SQLO_FNEX]
14693 | | | | | | sqlex_read_aud_cfg data [probe 20]
14694 | | | | | | | sqlt_logerr_zrc entry
14695 | | | | | | | sqlt_logerr_zrc data [probe 10]
14696 | | | | | | | | sqlzGetZRCInfo entry
14697 | | | | | | | | sqlzGetZRCInfo exit [rc = 0xFFFFFC2C = -980]
14698 | | | | | | | | pdLogRC entry
14699 | | | | | | | | | _pdlogInt entry
14700 | | | | | | | | | | sqltGetDiagPath entry
14701 | | | | | | | | | | sqltGetDiagPath data [probe 10]
14702 | | | | | | | | | | sqltGetDiagPath exit
14703 | | | | | | | | | _pdlogInt data [probe 130]
14704 | | | | | | | | | | sqloGetInstanceOwnerName entry
14705 | | | | | | | | | | sqloGetInstanceOwnerName exit
14706 | | | | | | | | | | sqloprocname entry
14707 | | | | | | | | | | sqloprocname data [probe 5]
14708 | | | | | | | | | | sqloprocname exit
14709 | | | | | | | | | | sqltgets entry
14710 | | | | | | | | | | sqltgets exit
14711 | | | | | | | | | _pdlogInt data [probe 110]
14712 | | | | | | | | | | sqloopenp entry
14713 | | | | | | | | | | | sqloDetermineFsCachingPolicy entry
14714 | | | | | | | | | | | sqloDetermineFsCachingPolicy exit
14715 | | | | | | | | | | sqloopenp data [probe 20]
14716 | | | | | | | | | | | sqloflock entry
14717 | | | | | | | | | | | sqloflock data [probe 1]
14718 | | | | | | | | | | | sqloflock exit
14719 | | | | | | | | | | sqloopenp data [probe 140]
14720 | | | | | | | | | | sqloopenp exit [rc = 0x070F00CB = 118423755 = SQLO_FREV]
14721 | | | | | | | | | | sqlochgfileptr entry
14722 | | | | | | | | | | sqlochgfileptr data [probe 1]
14723 | | | | | | | | | | sqlochgfileptr exit
14724 | | | | | | | | | _pdlogInt data [probe 100]
14725 | | | | | | | | | | sqlzGetZRCInfo entry
14726 | | | | | | | | | | sqlzGetZRCInfo exit [rc = 0xFFFFFC2C = -980]
14727 | | | | | | | | | | sqlogmsg entry
14728 | | | | | | | | | | | sqlnlsGetRegVars entry
14729 | | | | | | | | | | | | sqloGetEnvInternal entry
14730 | | | | | | | | | | | | sqloGetEnvInternal exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
14731 | | | | | | | | | | | | sqloGetEnvInternal entry
14732 | | | | | | | | | | | | sqloGetEnvInternal exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
14733 | | | | | | | | | | | | sqloGetEnvInternal entry
14734 | | | | | | | | | | | | sqloGetEnvInternal exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
14735 | | | | | | | | | | | | sqloGetEnvInternal entry
14736 | | | | | | | | | | | | sqloGetEnvInternal exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
14737 | | | | | | | | | | | | sqloGetEnvInternal entry
14738 | | | | | | | | | | | | sqloGetEnvInternal exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
14739 | | | | | | | | | | | | sqloGetEnvInternal entry
14740 | | | | | | | | | | | | sqloGetEnvInternal exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
14741 | | | | | | | | | | | | sqloGetEnvInternal entry
14742 | | | | | | | | | | | | sqloGetEnvInternal exit [rc = 0x870F0104 = -2029059836 = RC_ENV_NOT_FOUND]
14743 | | | | | | | | | | | sqlnlsGetRegVars data [probe 90]
14744 | | | | | | | | | | | sqlnlsGetRegVars exit
14745 | | | | | | | | | | | sqlnlsgmsg entry
14746 | | | | | | | | | | | sqlnlsgmsg data [probe 1]
14747 | | | | | | | | | | | sqlnlsgmsg data [probe 2]
14748 | | | | | | | | | | | | sqlnlsgmsg_cpcv entry
14749 | | | | | | | | | | | | sqlnlsgmsg_cpcv data [probe 10]
14750 | | | | | | | | | | | | | sqlnlscmsg entry
14751 | | | | | | | | | | | | | sqlnlscmsg data [probe 10]
14752 | | | | | | | | | | | | | sqlnlscmsg data [probe 20]
14753 | | | | | | | | | | | | | | sqlnlsmapcc entry
14754 | | | | | | | | | | | | | | sqlnlsmapcc data [probe 1]
14755 | | | | | | | | | | | | | | sqlnlsmapcc data [probe 2]
14756 | | | | | | | | | | | | | | sqlnlsmapcc exit
14757 | | | | | | | | | | | | | sqlnlscmsg data [probe 30]
14758 | | | | | | | | | | | | | sqlnlscmsg data [probe 60]
14759 | | | | | | | | | | | | | sqlnlscmsg data [probe 80]
14760 | | | | | | | | | | | | | sqlnlscmsg exit
14761 | | | | | | | | | | | | sqlnlsgmsg_cpcv data [probe 200]
14762 | | | | | | | | | | | | sqlnlsgmsg_cpcv exit [rc = 0x00000028 = 40]
14763 | | | | | | | | | | | sqlnlsgmsg exit [rc = 0x00000028 = 40]
14764 | | | | | | | | | | sqlogmsg exit [rc = 0x00000028 = 40]
14765 | | | | | | | | | | sqlowrite entry
14766 | | | | | | | | | | sqlowrite exit
14767 | | | | | | sqlex_read_aud_cfg data [probe 20]
14768 | | | | | | | | | | sqloclose entry
14769 | | | | | | | | | | sqloclose data [probe 10]
14770 | | | | | | | | | | sqloclose exit
14771 | | | | | | | | | _pdlogInt exit
14772 | | | | | | | | pdLogRC exit
14773 | | | | | | | sqlt_logerr_zrc exit
14774 | | | | | | sqlex_read_aud_cfg exit [rc = 0x40000021 = 1073741857]
从这里我们看到了sqlex_read_aud_cfg 的入口和出口
其中:
14664 | | | | | | | sqloopenp entry
14692 | | | | | | | sqloopenp exit [rc = 0x860F000A = -2045837302 = SQLO_FNEX]
明显就是1323的问题所在,因为0x860F000A可以map到
Input ZRC string '0x860F000A' parsed as 0x860F000A (-2045837302).
ZRC value to map: 0x860F000A (-2045837302)
V7 Equivalent ZRC value: 0xFFFFE60A (-6646)
ZRC class :
Critical Media Error (Class Index: 6)
Component:
SQLO ; oper system services (Component Index: 15)
Reason Code:
10 (0x000A)
Identifer:
SQLO_FNEX
SQLO_MOD_NOT_FOUND
Identifer (without component):
SQLZ_RC_FNEX
Description:
File not found.
Associated information:
Sqlcode -980
SQLO_MOD_NOT_FOUND显然就是文件没有找到。
然后去fmt文件,从14693开始向上寻找:
14664 entry DB2 UDB oper system services sqloopenp cei (1.3.15.811.2)
pid 972 tid 288 cpid -1 node 0 sec 26 nsec 14971560
bytes 79
Data1 (PD_TYPE_FILE_NAME,47) File name:
\\LSDB_HA\DB2MSCS-DB2\DB2\security\db2audit.cfg
Data2 (PD_TYPE_UINT,4) unsigned integer:
852
Data3 (PD_TYPE_SINT,4) signed integer:
1536
14692 exit DB2 UDB oper system services sqloopenp cei (2.3.15.811.2)
pid 972 tid 288 cpid -1 node 0 sec 26 nsec 16839017
rc = 0x860F000A = -2045837302 = SQLO_FNEX
14693 data DB2 UDB bsu security sqlex_read_aud_cfg fnc (3.3.92.95.0.20)
pid 972 tid 288 cpid -1 node 0 sec 26 nsec 16843210 probe 20
bytes 12
Data1 (PD_TYPE_DEFAULT,4) Hexdump:
0A00 0F86 ....
可以看到了在打开文件\\LSDB_HA\DB2MSCS-DB2\DB2\security\db2audit.cfg的时候发生了错误。这里大家要注意14664下面紧接着14692,不要被fmt之中的14665到14691之间的内容迷惑。(这是因为OS在等待IO,所以切换到另一个进程去执行其他的东西)
因而现在我们对问题就有一个大概的印象了。
刚开始db2start,然后进入一个Interrupt Waiting,然后去另一个进程/线程开始执行DB2Main,在这之间当系统读取db2audit.cfg文件的时候发生了某些错误,然后系统返回1323错误,退出了Interrupt Waiting,然后db2start退出。
当知道了这些solution离我们就不远了。
我们可以更加深入地从硬件或者OS方面入手分析为什么发生错误,也可以直接重新创建一个好的db2audit.cfg。
不过学计算机的全是懒家伙,既然有了方便的方法我们自然不去找为什么它会出错了,直接db2audit configure reset:
http://publib.boulder.ibm.com/in ... 1%75%64%69%74%22%20
然后db2start一次成功。
结论:
db2trace是个非常强大的工具(前提是你要会看那些和天书一样的行号和function name),用来解决可以重复出现/再现的异常相当简洁明了(最好有明确的error code,如果没有error code,或者是性能调整问题db2trace一点用处也没有,并且发生错误时所作的工作不要过大,比如说load就很难用trace调试,我可不希望花N个小时去读10万行以上或者几十兆的flw文件)。与其花10个小时猜测调试各种可能性不如花2个小时坐下来踏踏实实地读trace。 |
|