|
问题:
客户有一个aix box,其中两个instance。
最近他们将两个instance 从v7 fp9升级到v8 fp11。
其中一个instance 没有问题。
另一个instance有3个数据库,其中一个数据库在作migrate db的时候报告sql10004c错误:
SQL10004C An I/O error occurred while accessing the database
directory.
分析:
1)了解系统信息
2)得到必要的诊断文件。
因为instance的其他database都可以成功migrate,因此问题估计只局限于这一个database
在db2diag.log里有如下信息:
2006-05-16-13.39.22.962416-240 I153574C271 LEVEL: Severe
PID : 1294402 TID : 772 PROC : db2hmon 0
INSTANCE: deviadm NODE : 000
FUNCTION: DB2 UDB, base sys utilities, sqledogd, probe:100
MESSAGE : Start db directory migration
2006-05-16-13.39.22.962732-240 I153846C338 LEVEL: Severe
PID : 1294402 TID : 772 PROC : db2hmon 0
INSTANCE: deviadm NODE : 000
FUNCTION: DB2 UDB, base sys utilities, sqledogd, probe:140
MESSAGE : dirtype =
DATA #1 : Hexdump, 1 bytes
0x3074F010 : 76 v
2006-05-16-13.39.23.046264-240 E154185C329 LEVEL: Warning (OS)
PID : 1294402 TID : 772 PROC : db2hmon 0
INSTANCE: deviadm NODE : 000
FUNCTION: DB2 UDB, oper system services, sqlochgfileptr, probe:100
CALLED : OS, -, unspecified_system_function
OSERR : EBADF (9) "Bad file number"
2006-05-16-13.39.23.046737-240 E154515C329 LEVEL: Warning (OS)
PID : 1294402 TID : 772 PROC : db2hmon 0
INSTANCE: deviadm NODE : 000
FUNCTION: DB2 UDB, oper system services, sqlochgfileptr, probe:100
CALLED : OS, -, unspecified_system_function
OSERR : EBADF (9) "Bad file number"
从这里我们屁也看不出来,除了好像是什么EBADF有点用。
从/usr/include/errno.h我们看到#define EBADF 9 /* Bad file descriptor */
因此初步设想,文件错误……
客户重新回到v7 instance restore database,然后再次升级,错误依旧……
看来并非偶然情况,值得分析……
从上面的db2diag.log我们看到错误是文件引起的,但具体什么文件偶们就8知道了,这样就只能让客户作db2trc
db2trc on -l 10M
db2trc clr
<reproduce the problem>
db2trc dmp db2trc.dmp
db2trc off
db2trc flw db2trc.dmp db2trc.flw
db2trc fmt db2trc.dmp db2trc.fmt
拿到db2trc以后在里面找sqlochgfileptr, 一下子就看到了:
4372 | | | | | | | | | | | sqlegred entry
4373 | | | | | | | | | | | | sqlochgfileptr entry
4374 | | | | | | | | | | | | | pdLogSysRC entry
4375 | | | | | | | | | | | | | | _pdlogInt entry
4376 | | | | | | | | | | | | | | | sqltGetDiagPath entry
4377 | | | | | | | | | | | | | | | sqltGetDiagPath data [probe 10]
4378 | | | | | | | | | | | | | | | sqltGetDiagPath exit
4379 | | | | | | | | | | | | | | _pdlogInt data [probe 130]
4380 | | | | | | | | | | | | | | | sqloGetInstanceOwnerName entry
4381 | | | | | | | | | | | | | | | sqloGetInstanceOwnerName exit
4382 | | | | | | | | | | | | | | | sqloGetName entry
4383 | | | | | | | | | | | | | | | sqloGetName exit
4384 | | | | | | | | | | | | | | | sqltgets entry
4385 | | | | | | | | | | | | | | | sqltgets exit
4386 | | | | | | | | | | | | | | _pdlogInt data [probe 110]
4387 | | | | | | | | | | | | | | | sqloopenp entry
4388 | | | | | | | | | | | | | | | sqloopenp data [probe 30]
4389 | | | | | | | | | | | | | | | | sqloflock entry
4390 | | | | | | | | | | | | | | | | sqloflock data [probe 1]
4391 | | | | | | | | | | | | | | | | sqloflock exit
4392 | | | | | | | | | | | | | | | sqloopenp exit [rc = 0x070F00CB = 118423755 = SQLO_FREV]
4393 | | | | | | | | | | | | | | | sqlochgfileptr entry
4394 | | | | | | | | | | | | | | | sqlochgfileptr exit
4395 | | | | | | | | | | | | | | | sqlowrite entry
4396 | | | | | | | | | | | | | | | sqlowrite exit
4397 | | | | | | | | | | | | sqlochgfileptr SYSTEM ERROR [probe 100]
4398 | | | | | | | | | | | | | | | sqloclose entry
4399 | | | | | | | | | | | | | | | sqloclose exit
4400 | | | | | | | | | | | | | | _pdlogInt exit
4401 | | | | | | | | | | | | | pdLogSysRC exit
4402 | | | | | | | | | | | | sqlochgfileptr exit [rc = 0x860F000A = -2045837302 = SQLO_FNEX]
4403 | | | | | | | | | | | sqlegred exit [rc = 0xFFFFD8EC = -10004]
但是光这些我们看不到程序逻辑,再往上翻页:
4207 | | | | | | | | | sqlederd entry
4208 | | | | | | | | | | sqledogd entry[/COLOR]
4209 | | | | | | | | | | | sqlfcsys entry
4210 | | | | | | | | | | | | sqlf_read_dbm_and_verif entry
4211 | | | | | | | | | | | | | sqlfpath entry
4212 | | | | | | | | | | | | | | sqledflt entry
4213 | | | | | | | | | | | | | | | sqloInstancePath entry
4214 | | | | | | | | | | | | | | | sqloInstancePath exit
4215 | | | | | | | | | | | | | | sqledflt exit
4216 | | | | | | | | | | | | | sqlfpath exit
4217 | | | | | | | | | | | | sqlf_read_dbm_and_verif data [probe 4]
4218 | | | | | | | | | | | | | sqloopenp entry
4219 | | | | | | | | | | | | | sqloopenp data [probe 30]
4220 | | | | | | | | | | | | | | sqloflock entry
4221 | | | | | | | | | | | | | | sqloflock data [probe 1]
4222 | | | | | | | | | | | | | | sqloflock exit
4223 | | | | | | | | | | | | | sqloopenp exit
4224 | | | | | | | | | | | | | sqloread entry
4225 | | | | | | | | | | | | | sqloread exit
4226 | | | | | | | | | | | | | sqlfcsum entry
4227 | | | | | | | | | | | | | sqlfcsum exit [rc = 0x0000529C = 21148]
4228 | | | | | | | | | | | | sqlf_read_dbm_and_verif exit
4229 | | | | | | | | | | | | sqloclose entry
4230 | | | | | | | | | | | | sqloclose exit
4231 | | | | | | | | | | | sqlfcsys exit
4232 | | | | | | | | | | | sqlegopn entry
4233 | | | | | | | | | | | | sqloopenp entry
4234 | | | | | | | | | | | | sqloopenp data [probe 30]
4235 | | | | | | | | | | | | | sqloflock entry
4236 | | | | | | | | | | | | | sqloflock data [probe 1]
4237 | | | | | | | | | | | | | sqloflock exit
4238 | | | | | | | | | | | | sqloopenp exit
4239 | | | | | | | | | | | sqlegopn exit
4240 | | | | | | | | | | | sqlegred entry
4241 | | | | | | | | | | | | sqlochgfileptr entry
4242 | | | | | | | | | | | | sqlochgfileptr exit
4243 | | | | | | | | | | | | sqloread entry
4244 | | | | | | | | | | | | sqloread exit
4245 | | | | | | | | | | | sqlegred exit
4246 | | | | | | | | | | | sqloclose entry
4247 | | | | | | | | | | | sqloclose exit
4248 | | | | | | | | | | | sqlt_logerr_str entry
4249 | | | | | | | | | | | sqlt_logerr_str data [probe 10]
4250 | | | | | | | | | | | | pdLog entry
4251 | | | | | | | | | | | | | _pdlogInt entry
4252 | | | | | | | | | | | | | | sqltGetDiagPath entry
4253 | | | | | | | | | | | | | | sqltGetDiagPath data [probe 10]
4254 | | | | | | | | | | | | | | sqltGetDiagPath exit
4255 | | | | | | | | | | | | | _pdlogInt data [probe 130]
4256 | | | | | | | | | | | | | | sqloGetInstanceOwnerName entry
4257 | | | | | | | | | | | | | | sqloGetInstanceOwnerName exit
4258 | | | | | | | | | | | | | | sqloGetName entry
4259 | | | | | | | | | | | | | | sqloGetName exit
4260 | | | | | | | | | | | | | | sqltgets entry
4261 | | | | | | | | | | | | | | sqltgets exit
4262 | | | | | | | | | | | | | _pdlogInt data [probe 110]
4263 | | | | | | | | | | | | | | sqloopenp entry
4264 | | | | | | | | | | | | | | sqloopenp data [probe 30]
4265 | | | | | | | | | | | | | | | sqloflock entry
4266 | | | | | | | | | | | | | | | sqloflock data [probe 1]
4267 | | | | | | | | | | | | | | | sqloflock exit
4268 | | | | | | | | | | | | | | sqloopenp exit [rc = 0x070F00CB = 118423755 = SQLO_FREV]
4269 | | | | | | | | | | | | | | sqlochgfileptr entry
4270 | | | | | | | | | | | | | | sqlochgfileptr exit
4271 | | | | | | | | | | | | | | sqlowrite entry
4272 | | | | | | | | | | | | | | sqlowrite exit
4273 | | | | | | | | | | sqledogd data [probe 100]
4274 | | | | | | | | | | | | | | sqloclose entry
4275 | | | | | | | | | | | | | | sqloclose exit
4276 | | | | | | | | | | | | | _pdlogInt exit
4277 | | | | | | | | | | | | pdLog exit
4278 | | | | | | | | | | | sqlt_logerr_str exit
4279 | | | | | | | | | | | sqledflt entry
4280 | | | | | | | | | | | | sqloInstancePath entry
4281 | | | | | | | | | | | | sqloInstancePath exit
4282 | | | | | | | | | | | sqledflt exit
4283 | | | | | | | | | | | sqledflt entry
4284 | | | | | | | | | | | | sqloInstancePath entry
4285 | | | | | | | | | | | | sqloInstancePath exit
4286 | | | | | | | | | | | sqledflt exit
4287 | | | | | | | | | | | sqloGetInstanceOwnerName entry
4288 | | | | | | | | | | | sqloGetInstanceOwnerName exit
4289 | | | | | | | | | | | sqlt_logerr_str entry
4290 | | | | | | | | | | | sqlt_logerr_str data [probe 10]
4291 | | | | | | | | | | | sqlt_logerr_str data [probe 20]
4292 | | | | | | | | | | | | pdLog entry
4293 | | | | | | | | | | | | | _pdlogInt entry
4294 | | | | | | | | | | | | | | sqltGetDiagPath entry
4295 | | | | | | | | | | | | | | sqltGetDiagPath data [probe 10]
4296 | | | | | | | | | | | | | | sqltGetDiagPath exit
4297 | | | | | | | | | | | | | _pdlogInt data [probe 130]
4298 | | | | | | | | | | | | | | sqloGetInstanceOwnerName entry
4299 | | | | | | | | | | | | | | sqloGetInstanceOwnerName exit
4300 | | | | | | | | | | | | | | sqloGetName entry
4301 | | | | | | | | | | | | | | sqloGetName exit
4302 | | | | | | | | | | | | | | sqltgets entry
4303 | | | | | | | | | | | | | | sqltgets exit
4304 | | | | | | | | | | | | | _pdlogInt data [probe 110]
4305 | | | | | | | | | | | | | | sqloopenp entry
4306 | | | | | | | | | | | | | | sqloopenp data [probe 30]
4307 | | | | | | | | | | | | | | | sqloflock entry
4308 | | | | | | | | | | | | | | | sqloflock data [probe 1]
4309 | | | | | | | | | | | | | | | sqloflock exit
4310 | | | | | | | | | | | | | | sqloopenp exit [rc = 0x070F00CB = 118423755 = SQLO_FREV]
4311 | | | | | | | | | | | | | | sqlochgfileptr entry
4312 | | | | | | | | | | | | | | sqlochgfileptr exit
4313 | | | | | | | | | | | | | _pdlogInt data [probe 90]
4316 | | | | | | | | | | | | | | sqlowrite entry
4317 | | | | | | | | | | | | | | sqlowrite exit
4318 | | | | | | | | | | sqledogd data [probe 140][/COLOR]
4319 | | | | | | | | | | | | | | sqloclose entry
4320 | | | | | | | | | | | | | | sqloclose exit
4321 | | | | | | | | | | | | | _pdlogInt exit
4322 | | | | | | | | | | | | pdLog exit
4323 | | | | | | | | | | | sqlt_logerr_str exit
4324 | | | | | | | | | | | sqloexec2 entry[/COLOR]
4327 | | | | | | | | | | | | sqloexecs entry
4332 | | | | | | | | | | | | sqloexecs exit
4333 | | | | | | | | | | | | sqlowchd entry
4355 | | | | | | | | | | | | sqlowchd data [probe 20]
4356 | | | | | | | | | | | | sqlowchd exit
4357 | | | | | | | | | | | sqloexec2 exit[/COLOR]
4358 | | | | | | | | | | | sqlegopn entry[/COLOR]
4359 | | | | | | | | | | | | sqloopenp entry
4360 | | | | | | | | | | | | sqloopenp data [probe 30]
4361 | | | | | | | | | | | | | sqloflock entry
4362 | | | | | | | | | | | | | sqloflock data [probe 1]
4363 | | | | | | | | | | | | | sqloflock exit
4364 | | | | | | | | | | | | sqloopenp exit
4365 | | | | | | | | | | | sqlegopn exit
4366 | | | | | | | | | | | sqlegred entry[/COLOR]
4367 | | | | | | | | | | | | sqlochgfileptr entry
4368 | | | | | | | | | | | | sqlochgfileptr exit
4369 | | | | | | | | | | | | sqloread entry
4370 | | | | | | | | | | | | sqloread exit
4371 | | | | | | | | | | | sqlegred exit
4372 | | | | | | | | | | | sqlegred entry
4373 | | | | | | | | | | | | sqlochgfileptr entry[/COLOR]
4374 | | | | | | | | | | | | | pdLogSysRC entry
4375 | | | | | | | | | | | | | | _pdlogInt entry
4376 | | | | | | | | | | | | | | | sqltGetDiagPath entry
4377 | | | | | | | | | | | | | | | sqltGetDiagPath data [probe 10]
4378 | | | | | | | | | | | | | | | sqltGetDiagPath exit
4379 | | | | | | | | | | | | | | _pdlogInt data [probe 130]
4380 | | | | | | | | | | | | | | | sqloGetInstanceOwnerName entry
4381 | | | | | | | | | | | | | | | sqloGetInstanceOwnerName exit
4382 | | | | | | | | | | | | | | | sqloGetName entry
4383 | | | | | | | | | | | | | | | sqloGetName exit
4384 | | | | | | | | | | | | | | | sqltgets entry
4385 | | | | | | | | | | | | | | | sqltgets exit
4386 | | | | | | | | | | | | | | _pdlogInt data [probe 110]
4387 | | | | | | | | | | | | | | | sqloopenp entry
4388 | | | | | | | | | | | | | | | sqloopenp data [probe 30]
4389 | | | | | | | | | | | | | | | | sqloflock entry
4390 | | | | | | | | | | | | | | | | sqloflock data [probe 1]
4391 | | | | | | | | | | | | | | | | sqloflock exit
4392 | | | | | | | | | | | | | | | sqloopenp exit [rc = 0x070F00CB = 118423755 = SQLO_FREV]
4393 | | | | | | | | | | | | | | | sqlochgfileptr entry
4394 | | | | | | | | | | | | | | | sqlochgfileptr exit
4395 | | | | | | | | | | | | | | | sqlowrite entry
4396 | | | | | | | | | | | | | | | sqlowrite exit
4397 | | | | | | | | | | | | sqlochgfileptr SYSTEM ERROR [probe 100]
4398 | | | | | | | | | | | | | | | sqloclose entry
4399 | | | | | | | | | | | | | | | sqloclose exit
4400 | | | | | | | | | | | | | | _pdlogInt exit
4401 | | | | | | | | | | | | | pdLogSysRC exit
4402 | | | | | | | | | | | | sqlochgfileptr exit [rc = 0x860F000A = -2045837302 = SQLO_FNEX]
4403 | | | | | | | | | | | sqlegred exit [rc = 0xFFFFD8EC = -10004][/COLOR]
这些红色部分的东西勉强能看出当时的逻辑……
在sqledogd 中作了一堆东西,然后到了sqledogd data [probe 140]以后进入sqloexec2 作真正的migration,然后打开一个什么文件,读取两次.
看起来在第二次读取得时候出现了这个问题.
然后在fmt文件中找到对应的行号:
4359 entry DB2 UDB oper system services sqloopenp cei (1.3.15.811.2)
pid 1130532 tid 1 cpid 85 node 0
bytes 124
Data1 (PD_TYPE_FILE_NAME,92) File name:
/udb/deviadm/unitpsp/container0/deviadm/NODE0000/SQL00001/deviadm/NODE0000/sqldbdir/sqldbdir
Data2 (PD_TYPE_UINT,4) unsigned integer:
852
Data3 (PD_TYPE_UINT,4) unsigned integer:
436
4379 data DB2 UDB RAS/PD component _pdlogInt fnc (3.3.134.7.0.130)
pid 1130532 tid 1 cpid 85 node 0 probe 130
bytes 108
Data1 (PD_TYPE_BITMASK,8) Bitmask:
0x0000000000000002
Data2 (PD_TYPE_ECFID,4) Function:
DB2 UDB, oper system services, sqlochgfileptr
Data3 (PD_TYPE_HEXINT,8) Hex integer:
0x0000000000000000
Data4 (PD_TYPE_ECFID,4) Function:
OS, -, unspecified_system_function
Data5 (PD_TYPE_HEXINT,8) Hex integer:
0x0000000000000009
Data6 (PD_TYPE_UINT,4) unsigned integer:
100
Data7 (PD_TYPE_UINT,4) unsigned integer:
3
Data8 (PD_TYPE_UINT,4) unsigned integer:
4
4397 SYSTEM ERROR DB2 UDB oper system services sqlochgfileptr cei (5.3.15.659.2.100)
pid 1130532 tid 1 cpid 85 node 0 probe 100
bytes 376
Data1 (PD_TYPE_DIAG_LOG_REC,360) Diagnostic log record:
2006-05-16-13.45.55.141789-240 E158120C359 LEVEL: Warning (OS)
PID : 1130532 TID : 1 PROC : db2bp
INSTANCE: deviadm NODE : 000
FUNCTION: DB2 UDB, oper system services, sqlochgfileptr, probe:100
CALLED : OS, -, unspecified_system_function
OSERR : EBADF (9) "A file descriptor does not refer to an open file."
这里我们就可以看出来db2在打开sqldbdir文件以后读取得时候出了问题。
因此向客户索要sqldbdir文件,发现在0x13位上是04,也就是说这个文件并没有被migrate。
现在问题就比较明朗了,有两种可能性
a)数据库没有被migrate
b)数据库已经migrate但是sqldbdir文件出了问题。
但是我们怎么检查到底数据库有没有被migrate好呢?
我们知道,如果做db2 list db directory的话db2寻找的是.../sqllib/sqldbdir/sqldbdir文件,但如果是db2 list db directory on xxxx的话db2就会在其目录下的sqldbdir文件中检索。
每次检索都会搜寻其中的数据库是否需要migreate。
因此让客户db2 list db directory on /udb/deviadm/unitpsp/container0/deviadm/NODE0000/SQL00001/
然后拿到生成的db2imdbdir文件,再让客户用db2trc格式化得到:
pid = 2359296 tid = 0 node = -1
1 sqlemdbd entry
2 sqlemdbd data [probe 5]
3 sqlemdbd data [probe 5]
4 sqlemdbd data [probe 5]
5 sqlemdbd data [probe 5]
6 | sqlemgdr entry
7 | sqlemgdr data [probe 4]
8 | sqlemgdr data [probe 5]
9 | sqlemgdr data [probe 6]
10 | sqlemgdr data [probe 7]
11 | sqlemgdr data [probe 8]
12 | | sqleMigOpenWrapper entry
13 | | sqleMigOpenWrapper data [probe 758]
14 | | sqleMigOpenWrapper data [probe 758]
15 | | sqleMigOpenWrapper data [probe 758]
16 | | sqleMigOpenWrapper data [probe 758]
17 | | | sqleMigCheckError entry
18 | | | sqleMigCheckError data [probe 758]
19 | | | sqleMigCheckError data [probe 758]
20 | | | sqleMigCheckError exit [rc = 2]
21 | | sqleMigOpenWrapper data [probe 758]
22 | | sqleMigOpenWrapper exit [rc = 1]
23 | | dir_check entry
24 | | | sqleMigOpenWrapper entry
25 | | | sqleMigOpenWrapper data [probe 4263]
26 | | | sqleMigOpenWrapper data [probe 4263]
27 | | | sqleMigOpenWrapper data [probe 4263]
28 | | | sqleMigOpenWrapper data [probe 4263]
29 | | | | sqleMigCheckError entry
30 | | | | sqleMigCheckError data [probe 4263]
31 | | | | sqleMigCheckError data [probe 4263]
32 | | | | sqleMigCheckError exit [rc = 2]
33 | | | sqleMigOpenWrapper data [probe 4263]
34 | | | sqleMigOpenWrapper exit [rc = 1]
35 | | | sqleMigOpenWrapper entry
36 | | | sqleMigOpenWrapper data [probe 4319]
37 | | | sqleMigOpenWrapper data [probe 4319]
38 | | | sqleMigOpenWrapper data [probe 4319]
39 | | | sqleMigOpenWrapper data [probe 4319]
40 | | | sqleMigOpenWrapper data [probe 4319]
41 | | | sqleMigOpenWrapper exit
42 | | | sqleMigCloseWrapper entry
43 | | | sqleMigCloseWrapper data [probe 4365]
44 | | | sqleMigCloseWrapper exit
45 | | dir_check exit
46 | | release_check entry
47 | | | sqleMigOpenWrapper entry
48 | | | sqleMigOpenWrapper data [probe 4528]
49 | | | sqleMigOpenWrapper data [probe 4528]
50 | | | sqleMigOpenWrapper data [probe 4528]
51 | | | sqleMigOpenWrapper data [probe 4528]
52 | | | sqleMigOpenWrapper data [probe 4528]
53 | | | sqleMigOpenWrapper exit
54 | | | sqleMigCloseWrapper entry
55 | | | sqleMigCloseWrapper data [probe 4839]
56 | | | sqleMigCloseWrapper exit
57 | | release_check exit
58 | sqlemgdr error [probe 70]
59 | sqlemgdr exit [rc = 0x000003E8 = 1000]
60 sqlemdbd exit [rc = 3]
嘿嘿,一看返回值是1000,于是说明了数据库已经被migrate好了,那么唯一的可能就是sqldbdir文件有错误。
自己创建一个目录,然后把路径做好,得到客户的四个分区下的sqldbdir,然后在testing enviornment上做db2 list db directory onxxxx,这样db2就会自动尝试升级这个文件。
奇怪的是在测试环境下运行一切ok,0x13位上也从0x04变成了0x09,将4个文件发回给客户替换原来的文件,db2stop db2start db2 connect成功……
结论:
不知道为什么在客户环境中打开并读取两次sqldbdir文件的时候报告EBADF,尤其是打开没问题,第一次读取可以,但是在第二次读取前的sqlochgfileptr 函数下出错。
sqlochgfileptr 实际上只是计算了offset后进行fseek的,因此原本我们怀疑客户有mix code(就是在一个instance下各个lib文件所属于的db2 release不同),因此:
1) which db2imdbd
/home/deviadm/sqllib/bin/db2imdbd
2) dump -H sqllib/adm/db2sysc |tee db2sysc.dump
3) dump -H sqllib/bin/db2imdbd |tee db2imdbd.dump
4) mount |tee mount.out
5) dump -H sqllib/lib/libdb2e.a |tee libdb2e.dump
但是从结果看起来一切正常,这样的话root cause还是不清楚……
不过database看起来运行一切都好,客户也就不追究了…… |
|