索引的群集因子问题

jsy_0711 · 发表于 2007-9-24 10:52

前天看两本书上介绍索引的群集因子，第一本说：群集因子和索引的leaf_blocks数越接近越好。第二本就成了：群集因子和表的块数越接近越好。我拿一个表看了一下，发现leaf_blocks和blocks数量差得很远，两本书都是Oracle Press系列的，解释为啥不一样呢？

warehouse · 发表于 2007-9-24 11:03

索引的群集因子是指：举个例子吧，感觉不太好描述
比如index column字段是name，name中有个values是a,假如a有100行，那么这100行的a可能存储在1个 data block上也可能存储在10 个data block上，当然更糟糕的情况是存在100个 data block上，这样当通过index scan a 时想想哪种情况好，当然存在1个data block上最好，因为scan的data block少啊，我们把存在1 个 data block上的这种情况称为index的群集因子低，存在100个data block上的称为索引的群集因子高，因此索引的群集因子也是相对而言的！

ihekoko · 发表于 2007-9-24 11:04

cost =
blevel +
ceiling(leaf_blocks * effective index selectivity) +
ceiling(clustering_factor * effective table selectivity)

Historically it has been common practice to say that a good index has a low clustering_factor, and a bad
index has a high clustering_factor.
There is obviously a degree of truth in this comment, especially in the light of what the clustering_factor
represents. However, I have always had an aversion to words like low, high, small, large, and expressions like close
to zero, when talking about Oracle. After all, is 10,000 a low clustering_factor or a high clustering_
factor? It’s low if you have 10,000 blocks in your table, and high if you have 100 blocks in your table. So you
might want to write a couple of little scripts that join user_tables to user_indexes (and other scripts for
partitioned tables, etc.) so that you can compare the critical figures.
In fact, for reasons I describe in Chapter 5, I often use the column avg_data_blocks_per_key to get
an idea of how good Oracle thinks the index is.

一个良好cluster的index的clustering_factor应该是越接近table的block数越好

ihekoko · 发表于 2007-9-24 11:13

In Figure 4-1, we have a table with four blocks and 20 rows, and an index on the column V1,
whose values are shown. If you start to walk across the bottom of the index, the first rowid
points to the third row in the first block. We haven’t visited any blocks yet, so this is a new
block, so we count 1. Take one step along the index, and the rowid points to the fourth row of
the second block—we’ve changed block, so increment the count. Take one step along the
index, and the rowid points to the second row of the first block—we’ve changed block again, so
increment the count again. Take one step along the index, and the rowid points to the fifth row
of the first block—we haven’t changed blocks, so don’t increment the count.
In the diagram, I have put a number against each row of the table—this is to show the
value of the counter as the walk gets to that row. By the time we get to the end of the index, we
have changed table blocks ten times, so the clustering factor is 10.
Notice how small clumps of data stop the clustering_factor from growing—look at block 2
where the value 8 appears four times because four consecutive entries in the index point to the
same block; the same effect shows up in block 3 to give three rows the value 6.
The table doesn’t have to be completely sorted for this type of thing to happen; it only
needs to have little clumps (or clusters) of rows that are nearly sorted—hence the term
clustering_factor, rather than sort_factor.
Given the way the clustering_factor is calculated, you will appreciate that the smallest
possible value has to be the same as the number of blocks in the table, and the largest possible
value has to be the same as the number of rows in the table—provided you have computed
statistics.
If there are lots of blocks like block 2 in the table, the clustering_factor will turn out to be
quite close to the number of blocks in the table, but if the data in the table is randomly scattered,
the clustering_factor will tend to come out close to the number of rows in the table.

warehouse · 发表于 2007-9-24 11:18

TO 楼上的朋友：
有没有对effective index selectivity的详细英文doc，或者给个你的这几个篇文章的链接

jsy_0711 · 发表于 2007-9-24 11:22

嗯嗯，感谢楼上的解答，说和leaf_blocks相近的就是<Oracle9i性能调整>这本书，itpub第二本还推荐过，它写错了？

ora110 · 发表于 2007-9-24 11:22

一个良好cluster的index的clustering_factor应该是越接近table的block数越好

凭感觉表块大些，所以赞同。

ihekoko · 发表于 2007-9-24 11:31

to :
warehouse
Cost Based Oracle Fundamental

to :
jsy_0711
这本书并不是像你讲的那样

when talking about Oracle. After all, is 10,000 a low clustering_factor or a high clustering_
factor? It’s low if you have 10,000 blocks in your table, and high if you have 100 blocks in your table.

注意这里是拿clustering_factor和table blocks向比
并没提到leaf_blocks

冰冻奶茶 · 发表于 2007-9-24 11:37

clustering_factor反映了在物理块上表中记录的连续性
如果块上的记录是按顺序的，而不是杂乱的
那么通过索引访问起来效率会高得多

ihekoko · 发表于 2007-9-24 12:24

测试：
CREATE TABLE t1 AS SELECT * FROM dba_objects;
CREATE INDEX idx_t1 ON t1(object_type);

blevel  : 1
leaf_blocks  : 95
clustering_factor  : 1250
distinct_keys :34

[PHP]
SQL> SELECT * FROM t1 WHERE object_type='VIEW';

Execution Plan
----------------------------------------------------------
0    SELECT STATEMENT Optimizer=CHOOSE (Cost=40 Card=1025 Bytes=8
      8150)

1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T1' (Cost=40 Card=1025 B
      ytes=88150)

2 1    INDEX (RANGE SCAN) OF 'IDX_T1' (NON-UNIQUE) (Cost=3 Card
      =1025)
[/PHP]
用autotrace  的cost是40
[PHP]
Table stats Table: T1 Alias: T1
  TOTAL ::  CDN: 34833  NBLKS:  492  AVG_ROW_LEN:  96
-- Index stats
  INDEX NAME: IDX_T1  COL#: 6
TOTAL ::  LVLS: 1 #LB: 95  #DK: 34  LB/K: 2  DB/K: 36  CLUF: 1250
_OPTIMIZER_PERCENT_PARALLEL = 0
***************************************
SINGLE TABLE ACCESS PATH
Column: OBJECT_TYP  Col#: 6    Table: T1 Alias: T1
NDV: 34       NULLS: 0       DENS: 2.9412e-002
NO HISTOGRAM: #BKT: 1 #VAL: 2
  TABLE: T1    ORIG CDN: 34833  ROUNDED CDN: 1025  CMPTD CDN: 1025
  Access path: tsc  Resc:  49  Resp:  49
  Access path: index (equal)
   Index: IDX_T1
  TABLE: T1
   RSC_CPU: 0 RSC_IO: 40
  IX_SEL:  0.0000e+000  TB_SEL:  2.9412e-002
  BEST_CST: 40.00  PATH: 4  Degree:  1
[/PHP]
验证下怎么算出的40
cost =
blevel +
ceiling(leaf_blocks * effective index selectivity) +
ceiling(clustering_factor * effective table selectivity)

ceiling(leaf_blocks * effective index selectivity) = ceil(95*1/34) = ceil(2.79) = 3

ceiling(clustering_factor * effective table selectivity) = ceil(1250*1/34) = ceil(36.76) = 37

blevel = 1 (blevel为1时算为0)

cost =
0+ 3 + 37 = 40