oracle 使用semi-join优化一例

fan_zhen_hua · 发表于 2011-6-20 22:39

blog文章链接地址：

http://www.dbafree.net/?p=224

贴一下文章，如下：

有一个查询，大致情况如下：
<pre>
SELECT COUNT(distinct A.BATCH_ID)
 FROM A, B
WHERE A.机构 = '类别'
AND B.BATCH_ID = A.BATCH_ID;
</pre>
A为批次表，4000条记录，B表为批次明细表，A表每个批次在B中对应5000条记录，所以B表大约在2000万条记录左右。A.机构 = '类别'的批次号大概600个。
上面查询的意思是：统计一下有批次明细的批次号，总共有多少个。
这个查询，我跑了大约10秒没没出来结果。原因很明显，它要先选出A表中XX类别的数据，过滤得到600个批次号，然后用这600个批次号跟B表所有的满足条件的记录做关联（大约600*5000=300万），取得结果后，再统计count( distinct batch_id)。
这个SQL转化一下其实就是下面这个SQL:
<pre>
SELECT COUNT(distinct A.BATCH_ID) FROM (
SELECT A.BATCH_ID
FROM A, B
 WHERE A.机构 = '类别'
AND B.BATCH_ID = A.BATCH_ID
);
</pre>
这里先来说明一下子查询的执行代码(如上红色部分),大致如下:
<pre>
for x in ( select BATCH_ID from A where A.机构 = '类别' )
loop
 for y in ( select BATCH_ID from B)
 loop
 if ( x.BATCH_ID == y.BATCH_ID )
 outPut_Record(x.BATCH_ID)
 End if
 end loop
end loop
</pre>
这个代码如何优化呢？
代码的主要部分是：
<pre>
loop
if ( x.BATCH_ID == y.BATCH_ID )
 OutPut_Record(x.BATCH_ID)
 End if
end loop
</pre>
对于这个查询而言，这个LOOP 循环，我们只要取到至少1条记录就足以判断结果了，所以，优化很容易，只要加个break就可以了。
<pre>
for x in ( select BATCH_ID from A where A.机构 = '类别' )
loop
 for y in ( select BATCH_ID from B)
 loop
 if ( x.BATCH_ID == y.BATCH_ID )
 outPut_Record(x.BATCH_ID)
 break;
 End if
 end loop
end loop
</pre>
--优化后的SQL即为，这个SQL只要判定B表中有没有BATCH_ID就可以了，即只要有一条，就终止执行。
这个优化后的SQL的逻辑读在3000个左右。执行效率已经比较高了。逻辑读个数约为：600*3+600*3=3600逻辑读。SQL语句如下：
<pre>
SELECT COUNT(a.BATCH_ID)
FROM A
 WHERE A.机构 = '类别'
and exists
 (select /*+ nl_sj(A B)*/
 1
 from B
 where b.BATCH_ID = a.BATCH_ID)
</pre>
其中nl_sj(A B)这个hints一般情况下可以不加。

关于semi-join和Anti-Join，大致说明如下：

semi-join（半连接）
Suppose you have the DEPT and EMP tables in the SCOTT schema and you want a list of departments with at least one employee. You could write the query with a conventional join:

<pre> SELECT D.deptno, D.dname
 FROM dept D, emp E
 WHERE E.deptno = D.deptno
 ORDER BY D.deptno;
</pre>
Unfortunately, if a department has 400 employees then that department will appear in the query output 400 times. You could eliminate the duplicate rows by using the DISTINCT keyword, but you would be making Oracle do more work than necessary. Really what you want to do is specify a semi-join between the DEPT and EMP tables instead of a conventional join:
<pre>
 SELECT D.deptno, D.dname
 FROM dept D
 WHERE EXISTS
 (
 SELECT 1
 FROM emp E
 WHERE E.deptno = D.deptno
 )
 ORDER BY D.deptno;
</pre>

多在子查询exists中使用，对外部row source的每个键值，查找到内部row source匹配的第一个键值后就返回，如果找到就不用再查找内部row source其他的键值了。
<pre>
for x in ( select * from dept )
loop
 for y in ( select * from emp)
 loop
 if ( x.deptno == y.deptno )
 OutPut_Record(x.dname)
 Break;
 End if
 end loop
end loop
</pre>

另外，还有一个Anti-Join（反连接）
Suppose you want a list of empty departments—departments with no employees. You could write a query that finds all departments and subtracts off the department of each employee:
<pre>
 SELECT D1.deptno, D1.dname
 FROM dept D1
 MINUS
 SELECT D2.deptno, D2.dname
 FROM dept D2, emp E2
 WHERE D2.deptno = E2.deptno
 ORDER BY 1;
</pre>
The above query will give the desired results, but it might be clearer to write the query using an anti-join:
<pre>
 SELECT D.deptno, D.dname
 FROM dept D
 WHERE NOT EXISTS
 (
 SELECT 1
 FROM emp E
 WHERE E.deptno = D.deptno
 )
 ORDER BY D.deptno;
</pre>

d_an_56 · 发表于 2011-6-22 10:00

友情up

ai2ming · 发表于 2011-6-22 13:05

写的不错。

[原创] oracle 使用semi-join优化一例