对Hash Join的一次优化

kkngisn · 发表于 2008-3-21 10:35

原帖由 eagle_fan 于 2008-3-21 10:29 发表

oracle通过hash函数生成值hash(col_name) ，然后根据hash(col_name)投放相应的bucket

所以col_name值相同肯定会在同一个bucket里面，col_name值不同也可能出现在一个bucket里面，因为当col_name number of distinct value > number of buckets是，难免会出现一个篮子里面有多个鸡蛋的情况

觉得有点粗糙，能否发份官方文档？

luckysea · 发表于 2008-3-21 10:44

不错

wang19840829 · 发表于 2008-3-21 11:14

经典,

,我还有很多路要走啊!!!!!!!!!!!!!!!!!!!!!!

sharklove · 发表于 2008-3-21 12:01

刚刚做了几个实验，感觉hash table的建立并不是简单的根据hash key的值，
因为这个实验中，发现不同的hash key居然可以放在一个bucket中，

两张表，small_table 有1857条记录，big_table有418072条记录

SQL> select /*+leading(a) full(a) use_hash(a b)*/ count(b.q)
  2  from small_table a,big_table b
  3  where a.stcdt=b.stcdt;

COUNT(B.Q)
----------
275340

Execution Plan
----------------------------------------------------------
0    SELECT STATEMENT Optimizer=CHOOSE (Cost=551 Card=1 Bytes=15)
1 0 SORT (AGGREGATE)
2 1    HASH JOIN (Cost=551 Card=100336 Bytes=1505040)
3 2    TABLE ACCESS (FULL) OF 'ST_RIVER_R' (Cost=188 Card=418072 Bytes=3762648)
4 2    TABLE ACCESS (FULL) OF 'ST_STINFO_B' (Cost=5 Card=1857 Bytes=11142)

在这个实验中，hash key是一个primary key，是唯一值。
SQL> select count(*) from small_table;

  COUNT(*)
----------
   1857

SQL> select count(distinct stcdt) from small_table;

COUNT(DISTINCTSTCDT)
--------------------
            1857

如果按照lz的说法，每个bucket中是不是最多只有一条记录呢？

看看trace信息
############# 10104 trace info #############

*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
#    than the true number.
Number of buckets with 0 rows:    2625
Number of buckets with 1 rows:    1152
Number of buckets with 2 rows:       260
Number of buckets with 3 rows:       51
Number of buckets with 4 rows:       8
Number of buckets with 5 rows:       0
Number of buckets with 6 rows:       0
Number of buckets with 7 rows:       0
Number of buckets with 8 rows:       0
Number of buckets with 9 rows:       0
Number of buckets with between  10 and  19 rows:       0
Number of buckets with between  20 and  29 rows:       0
Number of buckets with between  30 and  39 rows:       0
Number of buckets with between  40 and  49 rows:       0
Number of buckets with between  50 and  59 rows:       0
Number of buckets with between  60 and  69 rows:       0
Number of buckets with between  70 and  79 rows:       0
Number of buckets with between  80 and  89 rows:       0
Number of buckets with between  90 and  99 rows:       0
Number of buckets with 100 or more rows:       0
### Hash table overall statistics ###
Total buckets: 4096 Empty buckets: 2625 Non-empty buckets: 1471
Total number of rows: 1857
Maximum number of rows in a bucket: 4
Average number of rows in non-empty buckets: 1.262407

从trace信息中看到Maximum number of rows in a bucket: 4

d.c.b.a · 发表于 2008-3-21 12:14

大家都学会dump了

sharklove · 发表于 2008-3-21 12:30

不知道oracle是通过什么算法对hash key进行散列的？
对不同的数据类型是使用不同的算法，还是有通用的算法？

anlinew · 发表于 2008-3-21 12:49

原帖由 d.c.b.a 于 2008-3-21 12:14 发表
大家都学会dump了

不会呢还....

eagle_fan · 发表于 2008-3-21 12:52

原帖由 sharklove 于 2008-3-21 12:01 发表
刚刚做了几个实验，感觉hash table的建立并不是简单的根据hash key的值，
因为这个实验中，发现不同的hash key居然可以放在一个bucket中，

两张表，small_table 有1857条记录，big_table有418072条记录

SQL> select /*+leading(a) full(a) use_hash(a b)*/ count(b.q)
  2  from small_table a,big_table b
  3  where a.stcdt=b.stcdt;

COUNT(B.Q)
----------
275340

Execution Plan
----------------------------------------------------------
0    SELECT STATEMENT Optimizer=CHOOSE (Cost=551 Card=1 Bytes=15)
1 0 SORT (AGGREGATE)
2 1    HASH JOIN (Cost=551 Card=100336 Bytes=1505040)
3 2    TABLE ACCESS (FULL) OF 'ST_RIVER_R' (Cost=188 Card=418072 Bytes=3762648)
4 2    TABLE ACCESS (FULL) OF 'ST_STINFO_B' (Cost=5 Card=1857 Bytes=11142)

在这个实验中，hash key是一个primary key，是唯一值。
SQL> select count(*) from small_table;

  COUNT(*)
----------
   1857

SQL> select count(distinct stcdt) from small_table;

COUNT(DISTINCTSTCDT)
--------------------
            1857

如果按照lz的说法，每个bucket中是不是最多只有一条记录呢？

看看trace信息
############# 10104 trace info #############

*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
#    than the true number.
Number of buckets with 0 rows:    2625
Number of buckets with 1 rows:    1152
Number of buckets with 2 rows:       260
Number of buckets with 3 rows:       51
Number of buckets with 4 rows:       8
Number of buckets with 5 rows:       0
Number of buckets with 6 rows:       0
Number of buckets with 7 rows:       0
Number of buckets with 8 rows:       0
Number of buckets with 9 rows:       0
Number of buckets with between  10 and  19 rows:       0
Number of buckets with between  20 and  29 rows:       0
Number of buckets with between  30 and  39 rows:       0
Number of buckets with between  40 and  49 rows:       0
Number of buckets with between  50 and  59 rows:       0
Number of buckets with between  60 and  69 rows:       0
Number of buckets with between  70 and  79 rows:       0
Number of buckets with between  80 and  89 rows:       0
Number of buckets with between  90 and  99 rows:       0
Number of buckets with 100 or more rows:       0
### Hash table overall statistics ###
Total buckets: 4096 Empty buckets: 2625 Non-empty buckets: 1471
Total number of rows: 1857
Maximum number of rows in a bucket: 4
Average number of rows in non-empty buckets: 1.262407

从trace信息中看到Maximum number of rows in a bucket: 4

你没看清我的回复啊：

“col_name值不同也可能出现在一个bucket里面，因为当col_name number of distinct value > number of buckets是，难免会出现一个篮子里面有多个鸡蛋的情况”

一个大集合到小集合的映射当然会出现N->1的情况

foreverlee · 发表于 2008-3-21 13:40

原帖由 eagle_fan 于 2008-3-21 12:52 发表

你没看清我的回复啊：

“col_name值不同也可能出现在一个bucket里面，因为当col_name number of distinct value > number of buckets是，难免会出现一个篮子里面有多个鸡蛋的情况”

一个大集合到小集合的映射当然会出现N->1的情况

SQL> select count(distinct stcdt) from small_table;

COUNT(DISTINCTSTCDT)
--------------------
1857

但是这里
col_name number of distinct value = 1857 而number of buckets是4096 按理说现在篮子多出来很多但是还是出现了一个篮子装多个鸡蛋的情况所以有些confused.

sharklove · 发表于 2008-3-21 13:49

原帖由 foreverlee 于 2008-3-21 13:40 发表

SQL> select count(distinct stcdt) from small_table;

COUNT(DISTINCTSTCDT)
--------------------
1857

但是这里
col_name number of distinct value = 1857 而number of buckets是4096 按理说现在篮子多出来很多但是还是出现了一个篮子装多个鸡蛋的情况所以有些confused.

我想oracle的hash算法应该是某种类似mod()的算法，不同的值出现在一个bucket里很正常，但绝不是因为因为distinct值大于buckets数目时，才会把不同值放在一个bucket中

[精华] 对Hash Join的一次优化