测试buffered I/O 和 direct I/O

jungleduan · 发表于 2005-8-20 11:10

only if you have Oracle sharing your system w/ other stuff (like NFS...).
Of course you need some footprint for the OS.

-jg

<==
Actually, saying "always" may not be appropriate. If the application and the database do not follow some I/O pattern and the usage is very unpredictable, using filesystem cache with a fairly large buffer cache may be needed. That way, when Oracle fails to maintain blocks in memory that need to be in memory for some reason, the filesystem comes to the rescue.

I know this is debatable. So testing in each individual case is encouraged.

jungleduan · 发表于 2005-8-20 11:25

==> I am not sure about the implementation on Solaris. Topic below is on Linux. Fix me if there is anything wrong.

1. AIO is supported default on SOLARIS plateforms.if you use direct IO,system will run unstable .

==> AIO and DIO can be used at the same time. In fact, they should be used at the same time. Buffer IO will result in more unstable condition because you depends on OS to help you manage your system (e.g. other application flush the caching buffer resulting in a lot of paging)

2. you can use dd under buffered I/O & direct I/O ,then issue iostat -xntc 3 command to check I/O stats,the BS in dd command can be 8k(just as the db_block_size), or 2M.

3. you can also use mkfile ,then use iostat -xntc 3 .

4. you can also use iometer,a tool provided by SUN and be available in www.sun.com to check io stats under buffered I/O & direct I/O .

==> iostat on linux can only monitor the physical IO instead of the logic ones. to look at which metric depends on what kind of IO pattern you have. If you do a lot of full table scan and hash join, bandwidth is your key. check on how many KB/sec you reached. default buffer IO here is good becaue OS will read ahead disk content for you. However I still perfer direct IO by change Oracle multiple_block parameter. At the same time, AIO will help you if you have a RAID config. (>5% but pay attention you don't have bottleneck either on the SCSI/FC bus on the disk number)

If you do a lot of random IO (OLTP), turn off Os buffer will help you (Or at least you should turn off OS read ahead feature). Look at iostat on await column, which is the real latency you get from Os level. it should be less than 10ms. Turn on AIO can also help a lot here. Or even you can consider vector IO (2.6 kernel has a good support but I am not sure Oracle). If you have large write cache on you raid controller, turn on/off it depends on your IO workload. definitly you should put a dedicated bay for log and turn cache on. and for data, maybe or not.

cc59 · 发表于 2005-8-20 18:39

最初由 biti_rainy 发布
[B]

这眼神也忒……

EMC SANs -------- not EMS SCANs [/B]

：）一不小心写错了。

玉面飞龙 · 发表于 2005-8-23 05:09

Thanks for your experienced words.

I also got Tom's reply listed below.

Followup:
I would set up a load test and just use something as simple as statspack to
measure the before and after.

Most of the times you are not waiting on direct IO, dbwr, lgwr, they are -- you
don't care of they wait UNLESS you wait for them, so you would be looking for
things like "less log file syncs" (that is when WE wait on LGWR), less free
buffer waits (us waiting on dbwr) and so on.

Statspack with it's throughput numbers would be sufficient I would believe.

Don't forget, you might see a nose dive in performance as you REMOVE that double
cache that you have been relying on - you may well have to increase your buffer
cache to accomidate.

I'll post results after test,...

玉面飞龙 · 发表于 2005-9-3 16:08

db_block_size=8k  当IO请求大于256K的时候，系统使用direct IO 否则（包括256K）使用buffer IO

测试一并行full table scan，都在系统非常安静的时候运行

[php]
alter session set db_file_multiblock_read_out = 32;

select /*+ full(agdx) parallel(agdx,16) */
count(*)
from  very_big_table

db_file_multiblock_read_count = 32;的情况下，使用buffer I/O

alter session set db_file_multiblock_read_out = 34;

select /*+ full(agdx) parallel(agdx,16) */
count(*)
from  very_big_table

db_file_multiblock_read_count = 34;的情况下，使用direct I/O

[/php]

set timing 和vmstat 1 测试结果；显示的是buffer IO

[php]
SQL> alter session set db_file_multiblock_read_count=32;

Session altered.

Elapsed: 00:00:00.00
SQL> select /*+ full(agdx) parallel(agdx,16) */
count(*)
from  very_big_table;  2 3

  COUNT(*)
----------
259900404

Elapsed: 00:01:15.96

         extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t  %w  %b device
2.1 4.3 25.7 41.4  0.4  0.2 61.4 27.7 1 2 c0t0d0
0.0 0.1 0.5 0.4  0.0  0.0 0.0 6.9 0 0 c0t1d0
0.1 0.1 2.1 1.3  0.0  0.0 0.0 7.4 0 0 c0t2d0
2.1 4.3 25.6 41.4  0.4  0.2 62.1 28.3 1 3 c1t0d0
0.0 0.1 0.5 0.4  0.0  0.0 0.0 6.7 0 0 c1t1d0
0.1 0.1 2.1 1.3  0.0  0.0 0.0 8.1 0 0 c1t2d0

vmstat结果（片断）
procs    memory          page          disk       faults    cpu
r b w swap  free  re  mf pi po fr de sr s0 s1 s2 s1 in sy cs us sy id
15 0 0 12687848 1426488 16408 732 217240 0 0 0 0 48 0 0 22 58688 3971 4131 35 65 0
15 0 0 12687936 1426416 18395 5 188440 0 0 0 0 0 0 0 0 49980 4215 4271 38 62 0
15 3 0 12687392 1426112 15702 97 178576 0 0 0 0 3 0 0 11 58599 3904 4878 37 63 0
17 2 0 12689456 1427520 16141 49 210240 0 0 0 0 0 0 0 0 57191 3442 4455 35 65 0
14 0 0 12689728 1426864 18131 7 200960 0 0 0 0 2 0 0 2 49523 3411 3791 37 63 0
14 0 0 12688688 1424272 17532 901 168792 0 0 0 0 69 2 2 72 45783 5235 3942 39 61 0
12 0 0 12689768 1426208 17907 399 160952 8 8 0 0 14 0 0 33 49245 5859 4495 39 61 0
12 1 0 12689488 1424136 19003 20 164152 0 0 0 0 9 0 0 14 50383 3262 3753 39 61 0
13 0 0 12689536 1426112 18855 176 176264 0 0 0 0 25 0 0 9 47460 3334 3622 39 61 0
20 0 0 12690000 1426456 19031 240 172976 8 8 0 0 18 0 0 4 48491 5048 4265 42 58 0
9 0 0 12689976 1425552 19450 426 175472 0 0 0 0 0 0 0 3 44481 3767 3653 41 59 0
17 0 0 12689080 1425544 18737 935 175232 0 0 0 0 20 0 0 2 49451 5216 4172 41 59 0
15 0 0 12688920 1424352 18156 292 172488 0 0 0 0 1 0 0 0 43646 5299 3769 45 55 0
22 0 0 12686984 1424136 17354 1387 161688 0 0 0 0 9 0 0 1 47260 6242 4439 43 57 0
12 0 0 12687208 1423280 18860 614 157104 8 8 0 0 2 0 0 3 45905 5597 4241 39 61 0
16 2 0 12687104 1424072 16667 360 373176 0 0 0 0 9 0 0 6 57433 3926 4504 36 64 0
20 0 0 12686184 1423040 18003 1346 168096 0 0 0 0 6 0 0 13 48196 9704 4106 39 61 0
17 2 0 12686648 1422680 18830 60 172992 0 0 0 0 5 0 0 0 48359 4000 4000 38 62 0
13 0 0 12687056 1425216 19135 207 168440 0 0 0 0 4 0 0 0 48452 4820 4298 43 57 0
procs    memory          page          disk       faults    cpu
r b w swap  free  re  mf pi po fr de sr s0 s1 s2 s1 in sy cs us sy id
18 0 0 12686760 1424440 19282 221 175280 0 0 0 0 2 0 0 0 48446 3558 4018 38 62 0
15 0 0 12686568 1424472 18504 1088 176512 0 0 0 0 2 0 0 2 48605 4815 4154 40 60 0
18 0 0 12685768 1422912 18954 674 171912 0 0 0 0 13 0 0 20 48754 5390 3818 40 60 0
24 0 0 12687088 1423400 17250 1194 157744 0 0 0 0 2 0 0 7 46669 8501 4538 42 58 0
17 1 0 12688120 1424184 17369 376 161448 0 0 0 0 3 0 0 23 48498 6249 4169 41 59 0
17 0 0 12687344 1424344 17495 356 179648 0 0 0 0 8 0 0 23 54474 4478 4047 40 60 0

[/php]

set timing 和vmstat 1 测试结果；如下显示的是direct IO

[php]

SQL>  alter session set db_file_multiblock_read_count=34;

Session altered.

Elapsed: 00:00:00.00
SQL>  select /*+ full(agdx) parallel(agdx,16) */
count(*)
from  very_big_table:

  COUNT(*)
----------
259900404

Elapsed: 00:00:54.79

                  extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t  %w  %b device
2.1 4.3 25.7 41.4  0.4  0.2 61.4 27.7 1 2 c0t0d0
0.0 0.1 0.5 0.4  0.0  0.0 0.0 6.9 0 0 c0t1d0
0.1 0.1 2.1 1.3  0.0  0.0 0.0 7.4 0 0 c0t2d0
2.1 4.3 25.6 41.4  0.4  0.2 62.1 28.3 1 3 c1t0d0
0.0 0.1 0.5 0.4  0.0  0.0 0.0 6.7 0 0 c1t1d0
0.1 0.1 2.1 1.3  0.0  0.0 0.0 8.1 0 0 c1t2d0

r b w swap  free  re  mf pi po fr de sr s0 s1 s2 s1 in sy cs us sy id
13 26 0 12657800 1426216 1680 0 18928 0 0 0 0 0 0 0 0 68574 8547 7821 62 38 0
11 29 0 12614632 1426192 2122 133 16224 0 0 0 0 1 0 0 1 65301 14980 7351 63 37 0
13 19 0 12658312 1426128 2402 205 19784 0 0 0 0 1 0 0 0 71806 12629 9751 57 43 0
10 30 0 12658216 1425808 1737 0 18184 0 0 0 0 0 0 0 0 74793 8105 8696 59 41 0
0 36 0 12657464 1426392 1541 0 12672 0 0 0 0 0 0 0  0 69195 7321 7482 63 32 5
0 42 0 12656464 1426264 1406 0 12784 0 0 0 0 0 0 0  0 68656 8158 8447 61 31 8
2 33 0 12658680 1426736 1251 0 11312 0 0 0 0 0 0 0  0 56435 6311 7152 55 24 22
3 27 0 12658296 1426656 1608 682 10416 0 0 0 0 0 0 0 0 66396 7769 6461 65 31 4
1 28 0 12656528 1425440 1444 686 12840 0 0 0 0 0 0 0 0 65131 8849 7865 60 33 7
0 35 0 12654952 1425440 1439 8 11560 0 0 0 0 1 0 0  1 73899 9344 9804 63 37 1
2 35 0 12656056 1425760 1516 0 12120 0 0 0 0 1 0 0  1 63881 6965 6250 59 32 8
0 43 0 12654584 1426056 1577 2 12328 0 0 0 0 0 0 0  0 68301 9007 7416 67 32 1
9 32 0 12657160 1426112 1441 16 12472 0 0 0 0 0 0 0 0 72630 13614 10008 57 41 2
0 30 0 12656448 1425240 1553 690 12392 0 0 0 0 0 0 0 0 62027 8782 6376 58 29 13
5 36 0 12656384 1425448 1465 8 12008 0 0 0 0 0 0 0  0 60118 7926 6151 57 24 19
0 48 0 12656424 1426048 1435 25 12144 0 0 0 0 1 0 0 1 64530 8394 7592 59 33 9
0 27 0 12657904 1427696 1556 0 11144 0 0 0 0 0 0 0  0 67054 7195 6558 63 31 6
2 31 0 12658760 1427928 1583 0 12928 0 0 0 0 0 0 0  0 64035 6790 6230 61 28 11
23 22 0 12659552 1427992 1450 0 12904 0 0 0 0 0 0 0 0 75471 9773 10424 59 37 4
procs    memory          page          disk       faults    cpu
r b w swap  free  re  mf pi po fr de sr s0 s1 s2 s1 in sy cs us sy id
6 26 0 12659744 1428064 1561 0 24504 0 0 0 0 0 0 0  0 70949 8786 7954 69 31 0
10 21 0 12659192 1427872 1680 0 13800 0 0 0 0 0 0 0 0 71520 7637 6837 69 31 0
0 45 0 12664592 1434752 1642 4 12752 0 0 0 0 1 0 0  1 70941 8146 7467 64 34 2
3 27 0 12664568 1434672 1592 310 7920 8 0 0 0 14 0 0 2 74117 13064 9851 62 38 1
17 24 0 12665224 1435056 1643 0 2512 0 0 0 0 1 0 0  1 74086 11899 9574 62 38 0
5 28 0 12665192 1435216 1657 0 368 0 0 0 0 0  0  0  0 72510 9082 8565 67 32 0
17 19 0 12663912 1433776 1587 1368 392 0 0 0 0 0 0 0 0 66726 12517 9651 67 33 0
9 29 0 12662072 1433600 2126 2130 624 0 0 0 0 20 0 0 15 64447 10678 6498 68 32 0
0 45 0 12662456 1434064 1463 8 352 0 0 0 0 8  0  1  6 60408 7256 6058 61 33 7
11 24 0 12662832 1434144 1042 0 408 0 0 0 0 36 2 0 38 70932 9134 9311 63 34 3
12 18 0 12662608 1433560 222 724 512 0 0 0 0 0 0 0  0 66063 13460 10770 63 37 0

[/php]

我们的系统由于大量没有优化的sql语句，与io相关的wait event db file scatter read 和db file sequence read都很高；之前我还担心direct IO下sequence read 因为没有OS file cache 会比buffer IO慢.

从如上timing 和vmstat结果来看，

1)disk sequence read 在direct IO下比Buffer IO稍微好一点 55s  VS 75s

2)direct IO下 usr比sy 要高好多， buffer IO下sy比usr要高。好事情？坏事情？

3)  in  中断  sy 系统调用 cs 上下文切换  direct IO 都比 buffer IO要高。  好事情？坏事情？

4) Buffer IO下，vmstat cpu等待队列总是很高，而IO队列却较低；
  direct IO相反，cpu等待队列一般，但IO等待队列很高。

5)iostat的结果差不多

6）明显buffer IO的page in很高

不清楚这些 in sy cs us sy id 的组合，如何表示系统性能的提高？

biti_rainy · 发表于 2005-9-3 16:30

buffer IO 涉及到  file  cache的管理，消耗系统cpu 高也是合理的。  系统消耗cpu高了，io 速度也可能降下来

buffer  io 有预读功能，但对写不利。消耗内存，还有系统管理的成本增加。
如果内存足够大，有buffer  来cache  file，可能带来  read 的提升。

所以重要的是看你完成相同的所有系统任务哪个消耗的总时间短，而不是这个过程中哪个系统指标看起来更好一点。

玉面飞龙 · 发表于 2005-9-3 16:42

upload a direct IO document by who ?? I forget.

When should you start thinking about it,int the first place? the most significant indication is significant paging activity and large percent of CPU time spent in the kernel mode

buffer io 有预读功能，但对写不利。消耗内存，还有系统管理的成本增加。

这个dd 的结果很明显

[php]

Direct IO mode

Server>time dd if=pradb.dbf of=/data/oracle/sprrprd1/data23/steven_zhang.dmp bs=264k
7451+1 records in
7451+1 records out

real 0m45.67s
user 0m0.04s
sys    0m10.47s

DIRECT IO>vmstat 1
procs    memory          page          disk       faults    cpu
r b w swap  free  re  mf pi po fr de sr s0 s1 s2 s1 in sy cs us sy id
5 1 0 15582472 3587520 0 0  0  0  0  0  0  6  0  0  6 4294967196 0 0 -119 -95 -192
0 1 0 12702488 1712368 0 14 16 0  0  0  0  0  0  0  0 21184 2697 1758 1  4 94
0 1 0 12702488 1712368 0 0  0  0  0  0  0  0  0  0  0 26486 688 1465  0  6 94
0 1 0 12702264 1712368 0 18 0  0  0  0  0  1  0  0  1 24945 2567 2217 2  5 93
0 1 0 12702472 1712352 0 0 128 0  0  0  0  0  0  0  0 26258 1661 2012 1  7 92
0 0 0 12701456 1711584 51 708 16 0 0 0  0  0  0  0  0 24429 4111 2514 4  9 87
0 1 0 12702472 1712352 0 0 80  0  0  0  0  0  0  0  0 22492 726 1386  0  4 96
0 1 0 12704536 1713928 1 16 32 0  0  0  0  0  0  0  0 26411 7579 3185 1  8 91
0 1 0 12704536 1713928 0 0 32  0  0  0  0  0  0  0  0 25733 702 1381  0  7 93
0 1 0 12704536 1713928 0 0  0  0  0  0  0  0  0  0  0 25875 1428 1906 0  6 94
0 1 0 12704536 1713928 0 0  0  0  0  0  0  0  0  0  0 27016 1110 1698 1  6 93
0 1 0 12704536 1713928 0 0 16  0  0  0  0  2  0  0  2 21600 2436 1862 3  5 92
0 1 0 12704664 1714056 92 626 64 0 0 0  0  0  0  0  0 16056 2016 1599 1  4 95
[/php]

差距明显

[php]

Buffer IO的sy和pi都较高，时间也多

Server>>time dd if=pradb.dbf of=/data/oracle/sprrprd1/data23/steven_zhang.dmp2 bs=256k
7684+1 records in
7684+1 records out

real 0m54.45s
user 0m0.10s
sys    0m51.17s

procs    memory          page          disk       faults    cpu
r b w swap  free  re  mf pi po fr de sr s0 s1 s2 s1 in sy cs us sy id
0 0 0 12689008 1431832 4683 16 36952 0 0 0 0 0 0 0  0 18223 4422 1526 2 19 79
0 0 0 12689008 1431960 4810 205 37912 0 0 0 0 1 0 0 0 17267 1399 977  1 16 84
0 0 0 12689008 1431816 4743 0 37904 0 0 0 0 0 0  0  0 17066 774  757  0 17 82
0 0 0 12688344 1431848 4832 691 37888 0 0 0 0 0 0 0 0 17834 2519 1140 2 17 81
0 0 0 12689008 1431976 4788 17 37816 0 0 0 0 0 0 0  0 18127 2646 1401 3 18 79
0 0 0 12689008 1430704 4805 0 38192 0 0 0 0 0 0  0  0 17442 1197 914  1 15 84
0 0 0 12690632 1432328 4744 5 37632 32 32 0 0 8 0 0 7 17275 1339 810  1 16 83
0 0 0 12693632 1435544 4734 0 37648 16 16 0 0 1 0 0 2 17081 961  740  0 18 82
0 0 0 12693632 1435608 4709 312 37504 56 56 0 0 16 0 0 2 16821 1993 842 0 18 82
0 0 0 12698368 1440112 4701 0 37136 8 8 0 0 1 0  0  1 16750 520  568  0 16 83
0 0 0 12698368 1440120 4695 0 37120 0 0 0 0 0 0  0  0 16675 762  601  0 17 83
0 0 0 12698368 1440120 4674 0 37120 8 8 0 0 1 0  0  1 16634 526  574  0 17 83
0 0 0 12698368 1439936 4576 995 35720 8 8 0 0 4 0 0 3 17448 6559 1988 5 19 76
[/php]

可以在数据库上测试了

Yong Huang · 发表于 2005-9-4 08:34

最初由玉面飞龙发布
[B]db_block_size=8k 当IO请求大于256K的时候，系统使用direct IO 否则（包括256K）使用buffer IO
...
select /*+ full(agdx) parallel(agdx,16) */ count(*) from very_big_table
...
[/B]

Are you saying that with db_block_size=8k, when db_file_multiblock_read_count (dfmrc) is > 32, the I/O becomes direct, and when dfmrc <= 32, the I/O becomes buffered, without changing anything else? That is, whether the I/O is buffered or direct does not rely on filesystem mount option (directio) nor on Oracle parameter filesystem_options? It just so happens auto-magically? How did you verify the I/O is direct or buffered?

Secondly, you may want to add table alias agdx after very_big_table, even though it's likely the query indeed ran with full table scan. Does explain plan show that it's full table scan and it's a parallel execution?

By the way, your observations about buffered and direct I/O's are very interesting.

Yong Huang

玉面飞龙 · 发表于 2005-9-4 14:09

是的。Oracle的filesystem_options为async ，filesystem mount option 没有设置任何选项；>256K 就是用direct I/O; sa说的

We use Veritas file system (vxfs) which bypasses the cache for reads/write > 256K

编辑前，我将真实表名字用这个very_big_table 替换了一下，忘记修改alias agdx 了；SQL中并行度是16，可是如上测试环境只有8个CPU（max_parallel_servers=16）,而导致上下文切换比较高，可能是。

Yong Huang · 发表于 2005-9-4 22:13

最初由玉面飞龙发布
[B]是的。Oracle的filesystem_options为async ，filesystem mount option 没有设置任何选项；>256K 就是用direct I/O; sa说的
...
SQL中并行度是16，可是如上测试环境只有8个CPU（max_parallel_servers=16）,而导致上下文切换比较高，可能是。 [/B]

You're right. You must be using discovered direct I/O, a feature in VxFS. Since it's Solaris, you may be able to truss -u the server process (shadow process) and look for directio. See the first part of
http://rootshell.be/~yong321/oranotes/DirectIO.txt
for an example. But I'm not positive about these "signatures" because VxFS may have changed them, even though OS is Solaris.

You explain the high context switches with parallel executions. Then can you run it without parallel and see low context switches?

BTW, the uploaded Direct I/O paper must be from
http://www.mgogala.com/directio.pdf
It's written by Mladen Gogala. It would be better if you could find and post the URL, than uploading a full document here. (It's even worse when you do so and you can't remember the author.) Just a friendly reminder.

Yong Huang

[笔记] 测试buffered I/O 和 direct I/O

DirectIO like 鸡肋?

浏览过的版块