请教：关于利用数据链路建表死锁的问题？

feng_xin · 发表于 2002-7-11 12:53

最初由 Yong Huang 发布
[B]

That sounds strange. Disconnecting the network is equivalent to killing your sqlplus process without normal exit. The transaction should be rolled back.

Can someone else do a test? Open a sqlplus session to a database on another host, update a table without commit, disconnect the network cable.

If your observation is indeed true, can you tell us more details possibly with some screen text copied and pasted here? I'm interested to know how tcp_keepalive (called tcp_keepalive_interval on Solaris) affects the disrupted SQL session.

Yong Huang [/B]

Disconnecting the network is not equivalent to killing your sqlplus process without normal exit.
当SQLPLUS WITHOUT NORMAL EXIT时，我使用任务管理器结束SQLPLUS任务时，锁很快释放。
当我拔下网线，然后关闭SQLPLUS（normal or not normal)，锁要等到TCP_KEEPIDLE设定的值才释放。
无法做screen copy,因为若拔网线是话前台SCREEN没有任何变化，只能从server端观察LOCK是否释放。我做的具体步骤如下：
A机
1、SQLPLUS SCOTT/TIGER@ALPHA
2、UPDATE EMP SET ENAME='ADF';
3、拔掉网线
4、从TOAD可以看到锁依然存在
B机
1、SQLPLUS SCOTT/TIGER@ALPHA
2、UPDATE EMP SET ENAME=‘DFF’；
3、可以看到屏幕无任何反映，类似死机，任务管理器显示无响应
4、从toad可以看到上个锁依然存在，并又一个锁处于等待状态
当时间达到TCP_KEEPIDLE的值时，锁释放。

Yong Huang · 发表于 2002-7-13 07:06

最初由 feng_xin 发布
[B]Disconnecting the network is not equivalent to killing your sqlplus process without normal exit.
当SQLPLUS WITHOUT NORMAL EXIT时，我使用任务管理器结束SQLPLUS任务时，锁很快释放。
当我拔下网线，然后关闭SQLPLUS（normal or not normal)，锁要等到TCP_KEEPIDLE设定的值才释放。
无法做screen copy,因为若拔网线是话前台SCREEN没有任何变化，只能从server端观察LOCK是否释放。我做的具体步骤如下：
A机
1、SQLPLUS SCOTT/TIGER@ALPHA
2、UPDATE EMP SET ENAME='ADF';
3、拔掉网线
4、从TOAD可以看到锁依然存在
B机
1、SQLPLUS SCOTT/TIGER@ALPHA
2、UPDATE EMP SET ENAME=‘DFF’；
3、可以看到屏幕无任何反映，类似死机，任务管理器显示无响应
4、从toad可以看到上个锁依然存在，并又一个锁处于等待状态
当时间达到TCP_KEEPIDLE的值时，锁释放。 [/B]

I tested this today. Your observation is very likely to be correct. My database is: Oracle9i Enterprise Edition Release 9.0.1.0.0 - Production
OS is: SunOS 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-2
The client is Windows2000. I updated a table without commit. Unplugged the cable on that PC. Kept checking v$lock. The TM and TX locks existed for approximately 2 hours. And on OS we see:
$ ndd /dev/tcp tcp_keepalive_interval
7200000
which is 2 hours. Next time, I'll modify this parameter to a much shorter time and see how long the lock exists.

BTW, if I kill the session, the lock is immediately gone even though the entry in v$session lingers. Why do you not consider this an option?

I appreciate the message you posted. Before I read it, I only knew that the tcp_keepalive_interval parameter was used by TCP (at the transport level), not sure how any application used it. On Metalink, a search on this string turns out quite a number of articles.

Yong Huang

biti_rainy · 发表于 2002-10-6 10:09

加精华

bodyguard · 发表于 2002-10-7 20:37

可以参考
UNIX网络编程（第1卷）第158页关于SO_KEEPALIVE,TCP_KEEPALIVE部分
抱歉没有电子文档.

bodyguard · 发表于 2002-10-7 20:45

大意是一个TCP连接的两端，在一定时间内没有数据交换，TCP会自动给对方发一个探测分节，已确定对方的状态，并采取相应的措施。
采用修改内核参数的做法，将会影响到主机上所有打开了的套接口。
＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝
觉得可以试试tonylinux的办法，修改SQLNET.EXPIRE_TIME.

blue · 发表于 2002-10-8 11:15

oracle使用PMON进程来负责客户端连接中断后的清除工作，如回滚未提交的工作，释放锁等相关资源，此进程是每隔三分钟工作一次，并且这个时间间隔是不可配置的
问题是在于oracle是如何检测到非正常中断的客户端的连接。这个处理与TCP协议有关，因为SQL NET在使用TCP/IP协议进行网络时，当ORACLE连接异常终止时，因为是异常终止，终止信号并没有通过网络通知server端，因此只有下次server与client通信时，server才会发现此client已经断掉.（通过sqlnet.expire_time 的设置？）或者是通过操作系统级的检测，通过TCP_KEEPIDLE等参数（不同的系统平台各异）值的设置可以由操作系统自动检测到非正常中断的连接，从而通知oracle，使用pmon进程释放资源

jiangjp2000 · 发表于 2002-10-8 17:11

UNIXWARE中怎么更改TCP＿KEEPIDLE参数呢？

Yong Huang · 发表于 2002-10-8 21:42

最初由 blue 发布
[B]oracle使用PMON进程来负责客户端连接中断后的清除工作，如回滚未提交的工作，释放锁等相关资源，此进程是每隔三分钟工作一次，并且这个时间间隔是不可配置的
...[/B]

I thought it was every 3 seconds, not minutes. Where did you read this?

Yong Huang

blue · 发表于 2002-10-9 09:19

最初由 Yong Huang 发布
[B]

I thought it was every 3 seconds, not minutes. Where did you read this?

Yong Huang [/B]

metalink
note 53711.996
From: Cate Parry 11-Jan-01 15:49
Subject: PMON cleanup

RDBMS Version: 8.1.7.0.0
Operating System and Version: Sun Solaris 2.7
Error Number (if applicable):
Product (i.e. SQL*Loader, Import, etc.):
Product Version: 8.1.7.0.0

PMON cleanup

Is there an init.ora parameter I can set to have pmon cleanup sessions with a status of KILLED more frequently?

thanx in advance
-Cate

--------------------------------------------------------------------------------

From: Oracle, Melissa Holman 15-Jan-01 21:43
Subject: Re : PMON cleanup

Cate,

PMON is cleaning up and rolling back all uncommitted transactions every 3 minutes and this is not configurable.

You may want to check your parameter setting for CLEANUP_ROLLBACK_ENTRIES. . If you have lots of aborted or killed sessions, a high setting can be a problem as users may queue up behind locks held by the killed sessions as it will take longer to round robin through all sessions.

Refer to this note:

PARAMETER: INIT.ORA: CLEANUP_ROLLBACK_ENTRIES

Whenever an user session is KILLed, the session will remain in v$session marked as killed until the user session attempts something. On an attempt, such as a select, a message that this session is killed will be displayed to the user and then the entry is then removed from v$session.
Depending on what the session was doing at the time it was KILLed, it may take PMON some time to clean it up. After the process is KILLed, it is marked as KILLED and PMON starts the cleanup by rolling back any active transactions and releasing all locks held by that session. Once the transaction has been successfully rolled back all locks and resources held by that session should be released even though the session record may remain in v$session.

The problem is that if the client process no longer exists, there is no way for the message that the session as been killed to be propagated to the client process. Therefore, the session object cannot
be entirely cleaned up and the session record remains in v$session. The way to effect this is to either manually kill the shadow process from the OS. Another alternative is to enable dead connection detection (DCD) within Net8. DCD will periodically poll the client sessions to identify those that may have terminated abnormally and should initiate a cleanup of those that no longer exist.

WHAT IS DEAD CONNECTION DETECTION (DCD)?

Doc ID:  Note:1013364.6
Subject:  What is Dead Connection Detection (DCD)?
Type:  PROBLEM
Status:  PUBLISHED
Content Type:  TEXT/PLAIN
Creation Date:  23-OCT-1995
Last Revision Date:  22-APR-2002

Problem Description:
====================

This document discusses Dead Connection Detection (DCD).  DCD is a feature of
SQL*Net V2.1 and later. It detects when a partner in a SQL*Net V2
client/server or server/server connection has terminated unexpectedly and
releases the resources associated with it.

Search words:
sqlnet expire time
SQLNET.EXPIRE_TIME

DEAD CONNECTION DETECTION
                     =========================

OVERVIEW
--------

  Dead Connection Detection (DCD) is a feature of SQL*Net V2.1 and later. It
detects when a partner in a SQL*Net V2 client/server or server/server
connection has terminated unexpectedly, and releases the resources associated
with it.

  DCD is intended primarily for environments in which clients power down their
systems without disconnecting from their Oracle sessions, a problem
characteristic of networks with PC clients.

  DCD is initiated on the server when a connection is established.  At this
time SQL*Net reads the SQL*Net parameter files and sets a timer to generate an
alarm.  The timer interval is set by providing a non-zero value in minutes for
the SQLNET.EXPIRE_TIME parameter in the sqlnet.ora file.

  When the timer expires, SQL*Net on the server sends a "probe" packet to the
client.  (In the case of a database link, the destination of the link
constitutes the server side of the connection.)  The probe is essentially an
empty SQL*Net packet and does not represent any form of SQL*Net level data,
but it creates data traffic on the underlying protocol.

  If the client end of the connection is still active, the probe is discarded,
and the timer mechanism is reset.  If the client has terminated abnormally,
the server will receive an error from the send call issued for the probe, and
SQL*Net on the server will signal the operating system to release the
connection's resources.

  On Unix servers, the sqlnet.ora file must be in either $TNS_ADMIN or
$ORACLE_HOME/network/admin.  Neither /etc nor /var/opt/oracle alone is valid.

  It should be also be noted that in SQL*Net 2.1.x, an active orphan process
(one processing a query, for example) will not be killed until the query
completes.  In SQL*Net 2.2, orphaned resources will be released regardless of
activity.

  This is a server feature only.  The client may be running any supported
SQL*Net V2 release.

THE FUNCTION OF THE PROTOCOL STACK
----------------------------------

  While Dead Connection Detection is set at the SQL*Net level, it relies
heavily on the underlying protocol stack for it's successful execution.  For
example, you might set SQLNET.EXPIRE_TIME=1 in the sqlnet.ora file, but it is
unlikely that an orphaned server process will be cleaned up immediately upon
expiration of that interval.

  TCP/IP, for example, is a connection-oriented protocol, and as such, the
protocol will implement some level of packet timeout and retransmission in an
effort to guarantee the safe and sequenced order of data packets.  If a timely
acknowledgement is not received in response to the probe packet, the TCP/IP
stack will retransmit the packet some number of times before timing out.
After TCP/IP gives up, then SQL*Net receives notification that the probe
failed.

  The time that it takes TCP/IP to timeout is dependent on the TCP/IP stack,
and timeouts of many minutes are entirely common.  This has been an area of
concern for many customers, as many retransmissions at the protocol layer
causes what could be a significant lag between the expiration of the DCD
interval and the time when the orphaned process is actually killed.

  The easiest way to determine if the protocol stack is causing such a delay
involves testing different DCD intervals.

TESTING THE PROTOCOL STACK
--------------------------

  Set the SQLNET.EXPIRE_TIME parameter to 1 minute and note the time required
to clean up an orphaned server process.  Then set SQLNET.EXPIRE_TIME to 5
minutes and again observe the time required to clean up the shadow.  If the
TCP/IP timeout is the reason the server resources do not get released, the
time to clean up the shadow should increase by about 4 minutes.

  If the TCP/IP retransmission timeout is indeed the problem, the operating
system kernel can be tuned to reduce the interval for and number of packet
retransmissions (on many Unix platforms, the file
/usr/include/netinet/tcp_timer.h contains the configuration parameters).

  Reducing the interval and number of retransmissions may impact other system
components, since in effect you are shrinking the window allowed for
connections to process data, possibly resulting in inadvertent loss of
connections during periods of heavy system load.  Slower connections
from remote sites may be impacted by this change.

  Kernel parameters that may affect retransmission include but are not limited
to TCP_TTL, TCPTV_PERSMIN, TCPTV_MAX, and TCP_LINGERTIME.

***To avoid disrupting other system processes, it is important to contact the
appropriate vendor for assistance in tuning the operating system kernel or
protocol stack.***

MONITORING DEAD CONNECTION DETECTION
------------------------------------

  The best way to determine if DCD is enabled and functioning properly is to
generate a server trace and search the file for the DCD probe packet.  To
generate a server trace, set TRACE_LEVEL_SERVER=16 and
TRACE_DIRECTORY_SERVER=<path> in sqlnet.ora on the server (note the
location of the sqlnet.ora file).  The resulting trace file will have a
filename of svr_<PID>.trc and will be located in the specified directory.

Is DCD Enabled?

Search the server trace file for an entry like the following:

      osntns: Enabling dead connection detection (1 min)

The timer interval listed should match the value of SQLNET.EXPIRE_TIME.

Is DCD Working?

Search the server trace file for DCD probe packets.  They will appear
in the form of empty data packets, as follows:

      nstimexp: entry
      nstimexp: timer expired at 05-OCT-95 12:15:05
      nsdo: entry
      nsdo: cid=0, opcode=67, *bl=0, *what=1, uflgs=0x2, cflgs=0x3
      nsdo: nsctx: state=8, flg=0x621c, mvd=0
      nsdo: gtn=93, gtc=93, ptn=10, ptc=2048
      nsdoacts: entry
      nsdofls: entry
      nsdofls: DATA flags: 0x0
      nsdofls: sending NSPTDA packet
      nspsend: entry
      nspsend: plen=10, type=6
      nttwr: entry
      nttwr: socket 4 had bytes written=10
      nttwr: exit
      nspsend: 10 bytes to transport
      nspsend

acket dump
      nspsend:00 0A 00 00 06 00 00 00  |........|
      nspsend:00 00 00 00 00 00 00 00  |........|
      nspsend: normal exit
      nsdofls: exit (0)
      nsdoacts: flushing transport
      nttctl: entry
      nsdoacts: normal exit
      nsdo: normal exit
      nstimexp: normal exit

The entry:

      nspsend:00 0A 00 00 06 00 00 00  |........|
      nspsend:00 00 00 00 00 00 00 00  |........|

represents the probe packet.  Note that DCD packets are 10 bytes long when
they are issued to the protocol stack.  Once the protocol header and trailer
bytes for the underlying protocols have been added, the packet could be
approximately 70 bytes long.

  If DCD is enabled, you will see these probe packets written to the trace
file when the timer expires.  If the server is a UNIX system, it might be
useful to establish a connection and tail the trace file:

      tail -f svr_<PID>.trc

The time elapsed after each probe packet is written to the server trace should
match the SQLNET.EXPIRE_TIME value.

KNOWN PROBLEMS OR LIMITATIONS
-----------------------------

  - Of the few reported problems, perhaps the most significant is DCD's poor
performance on Windows NT.  Dead connections are cleaned up only when the
server is rebooted and the database is restarted.  Exactly how well DCD works
on NT depends on the client's protocol implementation. SQL*Net v2.3 has
improved the performance over earlier releases.

  This has been logged as port-specific Bug#303578.

  - On SCO Unix, a problem was reported in which server processes spin,
consuming large amounts of CPU, once the DCD timer expires.  The problem is
due to improper signal handling and can be eliminated by disabling DCD.

  This is port-specific Bug#293264.

  - Orphaned resources are not released if only the client application is
terminated.  Only after the client PC has been rebooted does DCD release these
resources.  For example, if a Windows application is killed yet Windows
remains running, the probe packet may be received and discarded as if the
connection is still active.  As it currently stands, it appears that DCD
detects dead client machines, but not dead client processes.

  This is logged as generic Bug#280848.


  - The SQL*Net V2 implementation on MVS does not use the generic DCD
mechanism, and therefore the SQLNET.EXPIRE_TIME parameter does not apply.  The
KEEPALIVE function of IBM's TCP/IP is used instead.  This was implemented
prior to the development of DCD.

  This is documented in port-specific Bug#301318.

  - DCD relies heavily on issuing probe packets during any phase of the
connection.  This is not be possible with some protocols which run
half-duplex.  Hence, DCD is not enabled on protocols like APPC/LU6.2.

  This is not a bug, but is rather the intended design.

- Local connections using BEQ protocol adapters are not supported with DCD.
Local connections using the IPC protocol adapters are supported with DCD.

如果设置sqlnet.expire_time的话，应该和操作系统级的tcp/ip通讯机制有关
引自Cluster:
To speed up cleanup of killed sessions you can increase the value of the
CLEANUP_ROLLBACK_ENTRIES in init.ora parameter.
PMON processes CLEANUP_ROLLBACK_ENTRIES blocks every 3 minutes.
Therefore, we can calculate when all the blocks will have been cleared.
For example if you have 2753 undo blocks and CLEANUP_ROLLBACK_ENTRIES
equal to 20. It will take approximately
2753 blocks / 20 blocks per batch = 138 batches
138 batches * 3 minutes per batch = 414 minutes = 6 hours 54 minutes
to go through all the undo blocks.
不过此参数在8i里已经没有了

spark · 发表于 2002-10-9 09:46

ORACLE8（oracle8052）中，我记得有一个BUG，就是在过程中使用DDL语句，容易造成SESSION HANG。因为我曾经在过程中使用动态SQL创建表、视图和同义词，很简单的查询语句，经常执行老半天没有返回。后来我把动态语句拿出来放到脚本中执行就没有问题（通过参数传递办法）。

[精华] 请教：关于利用数据链路建表死锁的问题？

顶

有谁在SCOUNIXWARE7.1.1中解决过这个问题。

讨论好热烈。我遇过SESSION没有响应的情况，不过不是因为网络。