[翻译]Jonathan Lewis 关于直方图的系列文章(33楼更新至第三篇)

newkid · 发表于 2014-1-28 01:02

刘大师回邮件说:
James,

I've corrected the article and added a comment about it.

The division should have been: "(number of non-popular rows)/(number of non-popular values)"
And then the arithmetic should have been: 12/7 = 1.714

Regards

Jonathan Lewis
http://jonathanlewis.wordpress.com/all-postings

等会我将把译文相应修改一下。
谢谢oracledbacrs的较真！

oracledbacrs · 发表于 2014-2-8 15:31

不好意思，最近研究11g采样过程算法回头再来读这文章，还是发现一点小问题，撇开10g 9i来讲
第二章作者这样写道
我们有一千万行数据——并且我们为了创建等高直方图所做的第一件事就是对数据排序，所以第一步是相当耗费资源的。如果我们对数据采样，这会减少排序的数据量——但是这个采样可能会漏掉很多值，以至于Oracle以为能够创建一个频度直方图——那样就会使得优化器对某些实际上有几千行的值做出过低的估算。

疑问是，既然oracle11g采用新的ndv算法能够在收集直方图信息前就准确估算到字段的ndv值，那为什么会有‘以至于Oracle以为能够创建一个频度直方图'这种说法呢？求解，在我看来能够收集到准确的ndv值就能够迅速判断最优的直方图选择，假设ndv估算值是300，明显〉254，难道还会像作者说的那样尝试去创建频率直方图吗？

oracledbacrs · 发表于 2014-2-8 15:37

顺便问问lz有没有直方图信息收集过程详解的文章来瞧瞧，给个连接也行，谢谢

newkid · 发表于 2014-2-10 23:55

oracledbacrs 发表于 2014-2-8 15:31
不好意思，最近研究11g采样过程算法回头再来读这文章，还是发现一点小问题，撇开10g 9i来讲
第二章作者这样 ...

这里说的是采样的情况，如果你的数据分布很不均匀，采样到的部分恰好涵盖了很少的值，那么即使用NDV也无法正确估算出全表的情况。如果你采样比例设为100%当然就不会了。

你要的文章我也没有见过，如果有研究精神，可以去反编译DBMS_STATS包，或者跟踪一下它产生的SQL。

oracledbacrs · 发表于 2014-2-11 10:34

newkid 发表于 2014-2-10 23:55
这里说的是采样的情况，如果你的数据分布很不均匀，采样到的部分恰好涵盖了很少的值，那么即使用NDV也无法 ...

不是吧？11g中，是在分析表的统计信息时，就同时分析了除了直方图之外的所有基础字段统计信息，这个分析过程是个不采样的全表扫描，所以在分析直方图统计信息的时候，其实oracle已经知道了字段精确ndv值了，如果需要文章我可以给你链接

oracledbacrs · 发表于 2014-2-11 10:44

oracledbacrs 发表于 2014-2-11 10:34
不是吧？11g中，是在分析表的统计信息时，就同时分析了除了直方图之外的所有基础字段统计信息，这个分析过 ...

补充，使用新的ndv算法的前提是全表扫描

newkid · 发表于 2014-2-11 11:06

http://jonathanlewis.files.wordp ... stinct-sampling.pdf
3.3 Estimating table level NDV from sample

可见不一定是全表扫描，表级数据仍然是估算的。

oracledbacrs · 发表于 2014-2-11 12:08

newkid 发表于 2014-2-11 11:06
http://jonathanlewis.files.wordpress.com/2011/12/one-pass-distinct-sampling.pdf
3.3 Estimating tabl ...

In Oracle Database 11g, we use a completely different approach for gathering basic column statistics. We issue the following query to gather basic column statistics (again this is a simplified version for illustration purpose).

Query 2: Query Gathering Basic Column Statistics Using AUTO_SAMPLE_SIZE in 11g

You will notice in the new basic column statistics gathering query, no sampling clause is used. Instead we do a full table scan. Also, there is no more count(distinct C1) to gather NDV for C1. Instead, during the execution we inject a special statistics gathering row source to this query. The special gathering row source uses a one-pass, hash-based distinct algorithm to gather NDV. More information on how this algorithm works can be found in the paper, “efficient and scalable statistics gathering for large databases in Oracle 11g”. The algorithm requires a full scan of the data, uses a bounded amount of memory and yields a highly accurate NDV that is nearly identical to a 100 percent sampling (can be proven mathematically). The special statistics gathering row source also gathers the number of rows, number of nulls and average column length on the side. Since we do a full scan on the table, the number of rows, average column length, minimal and maximal values are 100% accurate.
嗯，研究下你的谢了

oracledbacrs · 发表于 2014-2-11 12:27

newkid 发表于 2014-2-11 11:06
http://jonathanlewis.files.wordpress.com/2011/12/one-pass-distinct-sampling.pdf
3.3 Estimating tabl ...

哥们你给的链接不能下载啊，能不能直接发我邮箱啊oracledbacrs@hotmail.com谢了啊

newkid · 发表于 2014-2-12 00:27

你贴的引文说的是使用AUTO_SAMPLE_SIZE的情况。
另外,49楼的引文说的是采用排序的方法，不是采用新的NDV的方法。
我在公司没法往外发东西，等回家了再把文章上传。

[翻译]Jonathan Lewis 关于直方图的系列文章(33楼更新至第三篇)

浏览过的版块