对Hash Join的一次优化

bluemoon0083 · 发表于 2008-3-21 13:51

不会dump

eagle_fan · 发表于 2008-3-21 13:59

原帖由 foreverlee 于 2008-3-21 13:40 发表

SQL> select count(distinct stcdt) from small_table;

COUNT(DISTINCTSTCDT)
--------------------
1857

但是这里
col_name number of distinct value = 1857 而number of buckets是4096 按理说现在篮子多出来很多但是还是出现了一个篮子装多个鸡蛋的情况所以有些confused.

并不是篮子比鸡蛋多就一定是一个篮子里面只有一个鸡蛋，我只是说如果鸡蛋比篮子多，那么肯定有一个篮子里面多于一个鸡蛋

当鸡蛋比篮子数目少时，也会出现一个篮子里面有多个鸡蛋的情况，因为有些篮子是空的，这取决于你的数据内容和oracle hash 函数

foreverlee · 发表于 2008-3-21 16:25

原帖由 sharklove 于 2008-3-21 12:01 发表
刚刚做了几个实验，感觉hash table的建立并不是简单的根据hash key的值，
因为这个实验中，发现不同的hash key居然可以放在一个bucket中，

两张表，small_table 有1857条记录，big_table有418072条记录

SQL> select /*+leading(a) full(a) use_hash(a b)*/ count(b.q)
  2  from small_table a,big_table b
  3  where a.stcdt=b.stcdt;

COUNT(B.Q)
----------
275340

Execution Plan
----------------------------------------------------------
0    SELECT STATEMENT Optimizer=CHOOSE (Cost=551 Card=1 Bytes=15)
1 0 SORT (AGGREGATE)
2 1    HASH JOIN (Cost=551 Card=100336 Bytes=1505040)
3 2    TABLE ACCESS (FULL) OF 'ST_RIVER_R' (Cost=188 Card=418072 Bytes=3762648)
4 2    TABLE ACCESS (FULL) OF 'ST_STINFO_B' (Cost=5 Card=1857 Bytes=11142)

在这个实验中，hash key是一个primary key，是唯一值。
SQL> select count(*) from small_table;

  COUNT(*)
----------
   1857

SQL> select count(distinct stcdt) from small_table;

COUNT(DISTINCTSTCDT)
--------------------
            1857

如果按照lz的说法，每个bucket中是不是最多只有一条记录呢？

看看trace信息
############# 10104 trace info #############

*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
#    than the true number.
Number of buckets with 0 rows:    2625
Number of buckets with 1 rows:    1152
Number of buckets with 2 rows:       260
Number of buckets with 3 rows:       51
Number of buckets with 4 rows:       8
Number of buckets with 5 rows:       0
Number of buckets with 6 rows:       0
Number of buckets with 7 rows:       0
Number of buckets with 8 rows:       0
Number of buckets with 9 rows:       0
Number of buckets with between  10 and  19 rows:       0
Number of buckets with between  20 and  29 rows:       0
Number of buckets with between  30 and  39 rows:       0
Number of buckets with between  40 and  49 rows:       0
Number of buckets with between  50 and  59 rows:       0
Number of buckets with between  60 and  69 rows:       0
Number of buckets with between  70 and  79 rows:       0
Number of buckets with between  80 and  89 rows:       0
Number of buckets with between  90 and  99 rows:       0
Number of buckets with 100 or more rows:       0
### Hash table overall statistics ###
Total buckets: 4096 Empty buckets: 2625 Non-empty buckets: 1471
Total number of rows: 1857
Maximum number of rows in a bucket: 4
Average number of rows in non-empty buckets: 1.262407

从trace信息中看到Maximum number of rows in a bucket: 4

根据sharklove实验的推想:
在这个实验中，hash key是一个primary key，是唯一值。
所有build_table中的hash colum经过某一个hash function处理后

分类讨论:
1 得到的Hash value应该是唯一的(hash key是一个primary key).又由于这里column of distinct value < hash bucket number
所以按道理每个hash bucket只应当对应一个row 但实际上Oracle 并不是这样.
Maximum number of rows in a bucket: 4

2 如果Oracle 取mod()类似的可循环性函数作为hash function做 rows与hash buckets的对应关系
那么每个hash bucket得到的row数量应该是相等的
但这里可以发现很不均匀.
Number of buckets with 0 rows:    2625
Number of buckets with 1 rows:    1152
Number of buckets with 2 rows:       260
Number of buckets with 3 rows:       51
Number of buckets with 4 rows:       8

所以我觉得Oracle的Hash Function不能单单用
select count(*),col1_name,col2_name,...colN_name from build_table
group by col1_name,col2_name....colN_name去准确描述. 但如果只是找规律应该可以采用.

foreverlee · 发表于 2008-3-21 16:29

原帖由 eagle_fan 于 2008-3-21 13:59 发表

并不是篮子比鸡蛋多就一定是一个篮子里面只有一个鸡蛋，我只是说如果鸡蛋比篮子多，那么肯定有一个篮子里面多于一个鸡蛋

当鸡蛋比篮子数目少时，也会出现一个篮子里面有多个鸡蛋的情况，因为有些篮子是空的，这取决于你的数据内容和oracle hash 函数

由于他primary key所以数据内容是唯一的所以这里感觉上Oracle在处理Hash Join所采用的Hash Function比较具有决定性 in terms of data mapping between build_table rows and hash buckets.

ps. All of sudden, I am not able to type Chinese.

OoNiceDream · 发表于 2008-3-21 17:33

学了点东西，不过，我会SQL优化还是很不懂，先收藏下，再好好研究研究。

l3f3f3 · 发表于 2008-3-22 10:30

无能不顶~

sharklove · 发表于 2008-3-22 14:51

原帖由 foreverlee 于 2008-3-21 16:25 发表

2 如果Oracle 取mod()类似的可循环性函数作为hash function做 rows与hash buckets的对应关系
那么每个hash bucket得到的row数量应该是相等的
但这里可以发现很不均匀.
Number of buckets with 0 rows:    2625
Number of buckets with 1 rows:    1152
Number of buckets with 2 rows:       260
Number of buckets with 3 rows:       51
Number of buckets with 4 rows:       8

所以我觉得Oracle的Hash Function不能单单用
select count(*),col1_name,col2_name,...colN_name from build_table
group by col1_name,col2_name....colN_name去准确描述. 但如果只是找规律应该可以采用.

用mod()之类的函数生成hash值只是在理论上（或者说总的概率上）每个bucket的row数量相等，
但实际情况还是要看hash key值的具体分布情况，
举个例子：
假设oracle的hash function是对10取余：mod(x,10)
对下列数据做散列，
1,
11,
21,
31,
41,
51,
61
你会发现即便每个数据都是唯一值，结果这些数最后都装到一个bucket里了。

这个问题要好好研究下。

alanyang001 · 发表于 2008-4-18 16:27

谢谢分享，分析非常透彻

xiaodong_1567 · 发表于 2008-4-21 20:02

好东西
慢慢学习

husthxd · 发表于 2008-5-8 09:59

marked

[精华] 对Hash Join的一次优化

浏览过的版块