受楼下老大的NULL贴启发，讲两个NULL与索引的小技巧

newkid · 发表于 2008-7-12 03:40

1.既然NULL是可以进复合索引的，在我们需要对NULL进行索引时，就可以构造一个“伪复合索引”：

CREATE INDEX my_index ON my_table(my_column,0);

后面这个零就是加入的伪列。这样以后在有 my_column IS NULL 的条件就可以利用索引了（当然最终使用与否还得由CBO决定）。

2.不想索引的行，即使不是NULL, 也可用函数把它剔除。
  假设有status_id列，里面有0:未处理,1:已处理两种状态，我们关心的仅仅是0的行，处理完就会改成1. 这样表中0的行仅仅是少数，大部分是1的行，数据量多了BTREE索引的维护就有开销。
  这时可以建立这样的索引：
  CREATE INDEX my_index ON my_table(DECODE(status_id,0,0));
  它只对0行数据进行索引。当你要取未处理数据时，SELECT * FROM my_table WHERE DECODE(status_id,0,0)=0 就可以高效利用索引。

xsaier · 发表于 2011-5-28 21:39

Newkid,非常感谢,正需要这个清晰的回复。

newkid · 发表于 2011-5-27 22:40

多一个DECODE计算而已, 不算什么负担。

〇〇 · 发表于 2011-5-27 22:28

20楼好像问这种索引，插入表有负担？

newkid · 发表于 2011-5-27 21:56

Indexing Only Some of the Rows
In addition to transparently helping out queries that use built-in functions like UPPER, LOWER, and so on,
function-based indexes can be used to selectively index only some of the rows in a table. As we’ll discuss
a little later, B*Tree indexes do not contain entries for entirely NULL keys. That is, if you have an index I
on a table T
Create index I on t(a,b);
and you have a row where A and B are both NULL, there will be no entry in the index structure. This
comes in handy when you are indexing just some of the rows in a table.
Consider a large table with a NOT NULL column called PROCESSED_FLAG that may take one of two
values, Y or N, with a default value of N. New rows are added with a value of N to signify not processed, and
as they are processed, they are updated to Y to signify processed. We would like to index this column to
be able to retrieve the N records rapidly, but there are millions of rows and almost all of them are going to
have a value of Y. The resulting B*Tree index will be large, and the cost of maintaining it as we update
from N to Y will be high. This table sounds like a candidate for a bitmap index (this is low cardinality, after
all!), but this is a transactional system and lots of people will be inserting records at the same time with
the processed column set to N and, as we discussed earlier, bitmaps are not good for concurrent
modifications. When we factor in the constant updating of N to Y in this table as well, then bitmaps
would be out of the question, as this process would serialize entirely.
So, what we would really like is to index only the records of interest (the N records). We’ll see how to
do this with function-based indexes, but before we do, let’s see what happens if we just use a regular
index. Using the standard BIG_TABLE script described in the setup section, we’ll update the TEMPORARY
column, flipping the Ys to Ns and the Ns to Ys
ops$tkyte@ORA11GR2> update big_table set temporary = decode(temporary,'N','Y','N');
1000000 rows updated.
and we’ll check out the ratio of Ys to Ns
ops$tkyte@ORA11GR2> select temporary, cnt,
2 round( (ratio_to_report(cnt) over ()) * 100, 2 ) rtr
3 from (
4 select temporary, count(*) cnt
5 from big_table
6 group by temporary
7 )
8 /
T CNT RTR
- ---------- ----------
N 1779 .18
Y 998221 99.82
As we can see, of the 1,000,000 records in the table, only about one-fifth of 1 percent of the data
should be indexed. If we use a conventional index on the TEMPORARY column (which is playing the role of
the PROCESSED_FLAG column in this example), we would discover that the index has 1,000,000 entries,
consumes almost 14MB of space, and has a height of 3:
ops$tkyte@ORA11GR2> create index processed_flag_idx
2 on big_table(temporary);
Index created.
ops$tkyte@ORA11GR2> analyze index processed_flag_idx
2 validate structure;
Index analyzed.
ops$tkyte@ORA11GR2> select name, btree_space, lf_rows, height
2 from index_stats;
NAME BTREE_SPACE LF_ROWS HEIGHT
------------------------------ ----------- ---------- ----------
PROCESSED_FLAG_IDX 14528892 1000000 3
Any retrieval via this index would incur three I/Os to get to the leaf blocks. This index is not only
wide, but also tall. To get the first unprocessed record, we will have to perform at least four I/Os (three
against the index and one against the table).
How can we change all of this? We need to make it so the index is much smaller and easier to
maintain (with less runtime overhead during the updates). Enter the function-based index, which
allows us to simply write a function that returns NULL when we don’t want to index a given row and
returns a non-NULL value when we do. For example, since we are interested just in the N records, let’s
index just those:
ops$tkyte@ORA11GR2> drop index processed_flag_idx;
Index dropped.
ops$tkyte@ORA11GR2> create index processed_flag_idx
2 on big_table( case temporary when 'N' then 'N' end );
Index created.
ops$tkyte@ORA11GR2> analyze index processed_flag_idx
2 validate structure;
Index analyzed.
ops$tkyte@ORA11GR2> select name, btree_space, lf_rows, height
2 from index_stats;
NAME BTREE_SPACE LF_ROWS HEIGHT
------------------------------ ----------- ---------- ----------
PROCESSED_FLAG_IDX 40012 1779 2
That is quite a difference—the index is some 40KB, not 14MB. The height has decreased as well. If
we use this index, we’ll perform one less I/O than we would using the previous taller index.

newkid · 发表于 2011-5-27 21:54

TOM的9i 10g 编程艺术：

11.4.3 只对部分行建立索引
基于函数的索引除了对使用内置函数（如UPPER、LOWER 等）的查询显然有帮助之外，还可以用来有
选择地只是对表中的某些行建立索引。稍后会讨论，B*树索引对于完成为NULL 的键没有相应的条目。也就
是说，如果在表T 上有一个索引I：
Create index I on t(a,b);

而且行中A 和B 都为NULL，索引结构中就没有相应的条目。如果只对表中的某些行建立索引，这就
能用得上。
考虑有一个很大的表，其中有一个NOT NULL 列，名为PROCESSED_FLAG，它有两个可取值：Y 或N，
默认值为N。增加新行时，这个值为N，指示这一行未得到处理，等到处理了这一行后，则会将其更新为Y
来指示已处理。我们可能想对这个列建立索引，从而能快速地获取值为N 的记录，但是这里有数百万行，
而且几乎所有行的值都为Y。所得到的B*树索引将会很大，如果我们把值从N 更新为Y，维护这样一个大
索引的开销也相当高。这个表听起来很适合采用位图索引（毕竟基数很低！），但这是一个事务性系统，可
能有很多人在同时插入记录（新记录的“是否处理”列设置为N），前面讨论过，位图索引不适用于并发修
改。如果考虑到这个表中会不断地将N 更新为Y，那位图就更不合适了，根本不应考虑，因为这个过程会
完全串行化。
所以，我们真正想做的是，只对感兴趣的记录建立索引（即该列值为N 的记录）。我们会介绍如何利
用基于函数的索引来做到这一点，但是在此之前，先来看如果只是一个常规索引会发生什么。使用本书最
前面“环境设置”一节中描述的标准BIG_TABLE 脚本，下面更新TEMPORARY 列，在此将Y 变成N，以及N
变成Y：
ops$tkyte@ORA10G> update big_table set temporary = decode(temporary,'N','Y','N');
1000000 rows updated.
现在检查Y 与N 地比例：
ops$tkyte@ORA10G> select temporary, cnt,
2 round( (ratio_to_report(cnt) over ()) * 100, 2 ) rtr
3 from (
4 select temporary, count(*) cnt
5 from big_table
6 group by temporary
7 )
8 /
T CNT RTR
- ---------- ----------
N 1779 .11.
Y 998221 99.82

可以看到，在表的11.000,000 条记录中，只有0.2%的数据应当加索引。如果在TEMPORARY 列上使用
传统索引（相对于这个例子中PROCESSED_FLAG 列的角色），会发现这个索引有11.000,000 个条目，占用了
超过14MB 的空间，其高度为3：

ops$tkyte@ORA10G> create index processed_flag_idx
2 on big_table(temporary);
Index created.
ops$tkyte@ORA10G> analyze index processed_flag_idx
2 validate structure;
Index analyzed.
ops$tkyte@ORA10G> select name, btree_space, lf_rows, height
2 from index_stats;
NAME BTREE_SPACE LF_ROWS HEIGHT
------------------------------ ----------- ----------
----------
PROCESSED_FLAG_IDX 14528892 1000000 3

通过这个索引获取任何数据都会带来3 个I/O 才能达到叶子块。这个索引不仅很“宽”，还很“高”。
要得到第一个未处理的记录，必须至少执行4 个I/O（其中3 个是对索引的I/O，另外一个是对表的I/O）。
怎么改变这种情况呢？我们要让索引更小一些，而且要更易维护（更新期间的运行时开销更少）。采
用基于函数的索引，我们可以编写一个函数，如果不想对某个给定行加索引，则这个函数就返回NULL；而
对想加索引的行则返回一个非NULL 值。例如，由于我们只对列值为N 的记录感兴趣，所以只对这些记录加
索引：
ops$tkyte@ORA10G> drop index processed_flag_idx;
Index dropped.
ops$tkyte@ORA10G> create index processed_flag_idx
2 on big_table( case temporary when 'N' then 'N' end );
Index created.
ops$tkyte@ORA10G> analyze index processed_flag_idx
2 validate structure;
Index analyzed.
ops$tkyte@ORA10G> select name, btree_space, lf_rows, height
2 from index_stats;
NAME BTREE_SPACE LF_ROWS HEIGHT

------------------------------ ----------- ----------
----------
PROCESSED_FLAG_IDX 40012 1779 2
这就有很大不同，这个索引只有大约40KB，而不是11..5MB。高度也有所降低。与前面那个更高的索
引相比，使用这个索引能少执行一个I/O。

xsaier · 发表于 2011-5-27 12:16

Newkid, ,对于你提出的意见.
2.不想索引的行，即使不是NULL, 也可用函数把它剔除。
  假设有status_id列，里面有0:未处理,1:已处理两种状态，我们关心的仅仅是0的行，处理完就会改成1. 这样表中0的行仅仅是少数，大部分是1的行，数据量多了BTREE索引的维护就有开销。
  这时可以建立这样的索引：
  CREATE INDEX my_index ON my_table(DECODE(status_id,0,0));
  它只对0行数据进行索引。当你要取未处理数据时，SELECT * FROM my_table WHERE DECODE(status_id,0,0)=0 就可以高效利用索引。
我发给我们公司dba看了下,他有如下的意见, 不知道你能不能印证下,谢谢.
I’m assuming the theory is to reduce the size of the index by only having the “READY” rows in the index.  This would reduce the size of the index and speed access to rows in the table by reducing I/O.  Have you done any testing to see?

You still have the overhead of the index and the overhead to maintain the index as the values change or rows are inserted.  The number of access to the table would still be the same.  So I can’t believe there is much of a difference.

oraclesea · 发表于 2008-7-14 14:16

不错

wangfans · 发表于 2008-7-14 12:28

原帖由 newkid 于 2008-7-13 21:53 发表

你用了NVL就不等价了，如果 t1.my_column = 0 呢？
还不如在子查询中用 IS NOT NULL.

一般人会认为　2 NOT IN (1, NULL) 是个 TRUE, 但实际上是个 NULL, 在 WHERE 中是被过滤掉的，这就是NULL造成的困惑。

支持

newkid · 发表于 2008-7-13 21:53

原帖由 kisa99 于 2008-7-13 15:57 发表
看了下那个帖子，可能是开始我没理解楼主的意思。不过还是可以用nvl来处理。
SELECT ... FROM t1 WHERE my_column NOT IN (SELECT nvl(my_column,0) FROM t2);

你用了NVL就不等价了，如果 t1.my_column = 0 呢？
还不如在子查询中用 IS NOT NULL.

一般人会认为　2 NOT IN (1, NULL) 是个 TRUE, 但实际上是个 NULL, 在 WHERE 中是被过滤掉的，这就是NULL造成的困惑。