【赢IPAD+1000元】PostgreSQL的互联网+特性如何跟业务更好的结合｜征文活动

wanzai01 · 发表于 2015-6-11 16:37

为什么数据库会膨胀，以及如何预防？
经常看到有人说表又膨胀了，那么导致对象膨胀的常见原因有哪些呢？

1. 未开启autovacuum
对于未开启autovacuum的用户，同时又没有合理的自定义vacuum调度的话，表的垃圾没有及时回收，新的数据又不断进来，膨胀是必然的。（新的数据包括插入和更新，更新产生新版本的记录）

2. 开启了autovacuum, 但是各种原因导致回收不及时，并且新的数据又不断产生，从而导致膨胀。
回收不及时的原因：
2.1. IO差
当数据库非常繁忙时，如果IO比较差，会导致回收垃圾变慢，从而导致膨胀。
这种一般出现在数据库中存在非常巨大的表，并且这些表在执行whole table vacuum (prevent xid wrapped, 或当表的年龄大于vacuum_freeze_table_age时会全表扫)，因此产生大量IO，这期间很容易导致自身或其他表膨胀。

2.2. autovacuum触发较迟
什么情况会触发autovacuum呢?
* A table needs to be vacuumed if the number of dead tuples exceeds a
* threshold.  This threshold is calculated as
*
* threshold = vac_base_thresh + vac_scale_factor * reltuples
如果没有设置表级别的autovacuum thresh和factor,那么默认使用参数文件配置的值。如下：
int                   autovacuum_vac_thresh;  // 默认50
double       autovacuum_vac_scale;  // 默认0.2
也就是说dead tuple达到约为表的20%时，才触发autovacuum。
然后回收又需要一定的时间，所以最终表的膨胀应该是超过20%的。

2.3. 所有worker繁忙，某些表产生的垃圾如果超过阈值，但是在此期间没有worker可以为它处理垃圾回收的事情。导致可能发生膨胀。
可fork的worker进程个数是参数autovacuum_max_workers决定的，初始化autovacuum共享内存时已固定了它的最大进程数。见代码，
src/backend/postmaster/autovacuum.c
/*
* AutoVacuumShmemInit
*             Allocate and initialize autovacuum-related shared memory
*/
void
AutoVacuumShmemInit(void)
{
      bool          found;

      AutoVacuumShmem = (AutoVacuumShmemStruct *)
            ShmemInitStruct("AutoVacuum Data",
                                             AutoVacuumShmemSize(),
                                             &found);

      if (!IsUnderPostmaster)
      {
            WorkerInfo    worker;
            int                   i;

            Assert(!found);

            AutoVacuumShmem->av_launcherpid = 0;
            dlist_init(&AutoVacuumShmem->av_freeWorkers);
            dlist_init(&AutoVacuumShmem->av_runningWorkers);
            AutoVacuumShmem->av_startingWorker = NULL;

            worker = (WorkerInfo) ((char *) AutoVacuumShmem +
                                                         MAXALIGN(sizeof(AutoVacuumShmemStruct)));

            /* initialize the WorkerInfo free list */
            for (i = 0; i < autovacuum_max_workers; i++)
                     dlist_push_head(&AutoVacuumShmem->av_freeWorkers,
                                                      &worker[i].wi_links);
      }
      else
            Assert(found);
}
如果数据库的表很多，而且都比较大，那么当需要vacuum的表超过了配置autovacuum_max_workers的数量，某些表就要等待空闲的worker。这个阶段就容易出现表的膨胀。
以前的PostgreSQL版本，一个数据库同一时间只会起一个worker进程，现在的版本已经没有这个限制了：
src/backend/postmaster/autovacuum.c
* Note that there can be more than one worker in a database concurrently.
* They will store the table they are currently vacuuming in shared memory, so
* that other workers avoid being blocked waiting for the vacuum lock for that
* table.  They will also reload the pgstats data just before vacuuming each
* table, to avoid vacuuming a table that was just finished being vacuumed by
* another worker and thus is no longer noted in shared memory.  However,
* there is a window (caused by pgstat delay) on which a worker may choose a
* table that was already vacuumed; this is a bug in the current design.
所以如果你的PostgreSQL集群有很多数据库的话，可能需要更多的worker进程来支撑。
另外需要注意一点，worker进程在工作时，每个worker最多会消耗的内存由以下参数决定：
#maintenance_work_mem = 64MB          # min 1MB
#autovacuum_work_mem = -1             # min 1MB, or -1 to use maintenance_work_mem
所以worker进程越多，内存需求量也越大。

2.4. 数据库中存在持有事务Exclusive锁的长事务
PostgreSQL目前存在一个非常严重的缺陷，当数据库中存在持有事务Exclusive锁的长事务，事务过程中产生垃圾的话，无法回收，导致数据库膨胀。
原因见：
src/backend/utils/time/tqual.c
/*
* HeapTupleSatisfiesVacuum
*
*    Determine the status of tuples for VACUUM purposes.  Here, what
*    we mainly want to know is if a tuple is potentially visible to *any*
*    running transaction.  If so, it can't be removed yet by VACUUM.
*
* OldestXmin is a cutoff XID (obtained from GetOldestXmin()).  Tuples
* deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
* still be visible to some open transaction, so we can't remove them,
* even if we see that the deleting transaction has committed.
*/
HTSV_Result
HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
                                             Buffer buffer)
{
后面通过测试来展示。

2.5. 开启了autovacuum_vacuum_cost_delay。
在开启了autovacuum_vacuum_cost_delay后，会使用基于成本的垃圾回收，这个可以有利于降低VACUUM带来的IO影响，但是对于IO没有问题的系统，就没有必要开启autovacuum_vacuum_cost_delay，因为这会使得垃圾回收的时间变长。
当autovacuum进程达到autovacuum_vacuum_cost_limit后，会延迟autovacuum_vacuum_cost_delay后继续。
      /*
      * Adjust cost limit of each active worker to balance the total of cost
      * limit to autovacuum_vacuum_cost_limit.
      */
      cost_avail = (double) vac_cost_limit / vac_cost_delay;
      dlist_foreach(iter, &AutoVacuumShmem->av_runningWorkers)
      {
            WorkerInfo    worker = dlist_container(WorkerInfoData, wi_links, iter.cur);

            if (worker->wi_proc != NULL &&
                     worker->wi_dobalance &&
                     worker->wi_cost_limit_base > 0 && worker->wi_cost_delay > 0)
            {
                     int                   limit = (int)
                     (cost_avail * worker->wi_cost_limit_base / cost_total);

                     /*
                     * We put a lower bound of 1 on the cost_limit, to avoid division-
                     * by-zero in the vacuum code.  Also, in case of roundoff trouble
                     * in these calculations, let's be sure we don't ever set
                     * cost_limit to more than the base value.
                     */
                     worker->wi_cost_limit = Max(Min(limit,
                                                                                    worker->wi_cost_limit_base),
                                                                              1);
            }
限制计算方法由另外几个参数决定：
包括在SHARED BUFFER中命中的块，未命中的块，非脏块的额外成本。
vacuum_cost_page_hit (integer)
The estimated cost for vacuuming a buffer found in the shared buffer cache. It represents the cost to lock the buffer pool, lookup the shared hash table and scan the content of the page. The default value is one.

vacuum_cost_page_miss (integer)
The estimated cost for vacuuming a buffer that has to be read from disk. This represents the effort to lock the buffer pool, lookup the shared hash table, read the desired block in from the disk and scan its content. The default value is 10.

vacuum_cost_page_dirty (integer)
The estimated cost charged when vacuum modifies a block that was previously clean. It represents the extra I/O required to flush the dirty block out to disk again. The default value is 20.
对于IO没有问题的系统，不建议设置autovacuum_vacuum_cost_limit。

2.6.autovacuum launcher process 唤醒时间太长
唤醒时间由参数autovacuum_naptime决定，autovacuum launcher进程负责告诉postmaster需要fork worker进程来进行垃圾回收，但是如果autovacuum launcher进程一直在睡觉的话，那完蛋了，有垃圾了它还在睡觉，那不就等着膨胀吗？
另外还有一个限制在代码中，也就是说不能小于MIN_AUTOVAC_SLEEPTIME 100毫秒：
src/backend/postmaster/autovacuum.c
/* the minimum allowed time between two awakenings of the launcher */
#define MIN_AUTOVAC_SLEEPTIME 100.0             /* milliseconds */
......
      /* The smallest time we'll allow the launcher to sleep. */
      if (nap->tv_sec <= 0 && nap->tv_usec <= MIN_AUTOVAC_SLEEPTIME * 1000)
      {
            nap->tv_sec = 0;
            nap->tv_usec = MIN_AUTOVAC_SLEEPTIME * 1000;
      }
......
            /*
               * Wait until naptime expires or we get some type of signal (all the
               * signal handlers will wake us by calling SetLatch).
               */
            rc = WaitLatch(&MyProc->procLatch,
                                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                       (nap.tv_sec * 1000L) + (nap.tv_usec / 1000L));
这个后面我会进行测试来展示它。

2.7 批量删除或批量更新，
例如对于一个10GB的表，一条SQL或一个事务中删除或更新9GB的数据，这9GB的数据必须在事务结束后才能进行垃圾回收，无形中增加了膨胀的可能。

2.8 大量的非HOT更新，会导致索引膨胀，对于BTREE索引来说，整个索引页没有任何引用才能被回收利用，因此索引比较容易膨胀。

[测试]
测试过程使用如下参数：
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 10
autovacuum_naptime = 1
autovacuum_vacuum_threshold = 5
autovacuum_analyze_threshold = 5
autovacuum_vacuum_scale_factor = 0.002
autovacuum_analyze_scale_factor = 0.001
autovacuum_vacuum_cost_delay = 0
测试数据：
postgres=# create table tbl (id int primary key, info text, crt_time timestamp);
CREATE TABLE
postgres=# insert into tbl select generate_series(1,2000000),md5(random()::text),clock_timestamp();
INSERT 0 2000000
postgres=# \dt+ tbl
                  List of relations
Schema | Name | Type  |  Owner |  Size  | Description
--------+------+-------+----------+--------+-------------
public | tbl  | table | postgres | 146 MB |
(1 row)

postgres=# \di+ tbl_pkey
                     List of relations
Schema | Name | Type  |  Owner | Table | Size  | Description
--------+----------+-------+----------+-------+-------+-------------
public | tbl_pkey | index | postgres | tbl | 43 MB |
(1 row)
测试脚本：
一次最多更新25万条
$ vi test.sql
\setrandom id 1 2000000
update tbl set info=md5(random()::text) where id between :id-250000 and :id+250000;
测试两个东西：
1. 数据库中存在持有事务Exclusive锁的长事务，这个事务时间段内，数据库产生的垃圾无法被回收。
postgres@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 1 -j 1 -T 500000
观察日志
postgres@db-172-16-3-150-> tail -f -n 1 postgresql-2015-04-29_174535.csv|grep removable
tuples: 500001 removed, 1710872 remain, 0 are dead but not yet removable
tuples: 0 removed, 478 remain, 3 are dead but not yet removable
tuples: 499647 removed, 1844149 remain, 0 are dead but not yet removable
tuples: 500001 removed, 1830118 remain, 0 are dead but not yet removable
tuples: 290450 removed, 1865527 remain, 0 are dead but not yet removable
现在看没有问题，接下来我开一个持有事务Exclusive锁的事务，
postgres=# begin;
BEGIN
postgres=# select txid_current();
txid_current
--------------
314030959
(1 row)
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
         6073
(1 row)
-[ RECORD 1 ]------+--------------
locktype          | virtualxid
database          |
relation          |
page             |
tuple             |
virtualxid       | 4/180
transactionid    |
classid          |
objid             |
objsubid          |
virtualtransaction | 4/180
pid             | 6073
mode             | ExclusiveLock
granted          | t
fastpath          | t
-[ RECORD 2 ]------+--------------
locktype          | transactionid
database          |
relation          |
page             |
tuple             |
virtualxid       |
transactionid    | 314030959
classid          |
objid             |
objsubid          |
virtualtransaction | 4/180
pid             | 6073
mode             | ExclusiveLock
granted          | t
fastpath          | f
这个事务在另一个会话中通过txid_current_snapshot可以看到它是一个未结束的事务。
postgres=# select * from txid_current_snapshot();
-[ RECORD 1 ]---------+------------------------------
txid_current_snapshot | 314030959:314030981:314030959
接下来看看日志：
不可回收的行在不断的增长。
tuples: 0 removed, 2391797 remain, 500001 are dead but not yet removable
tuples: 0 removed, 484 remain, 9 are dead but not yet removable
tuples: 0 removed, 2459288 remain, 500001 are dead but not yet removable
tuples: 0 removed, 484 remain, 9 are dead but not yet removable
tuples: 0 removed, 2713489 remain, 760235 are dead but not yet removable
tuples: 0 removed, 487 remain, 12 are dead but not yet removable
tuples: 0 removed, 2572991 remain, 760235 are dead but not yet removable
tuples: 0 removed, 487 remain, 12 are dead but not yet removable
tuples: 0 removed, 2849378 remain, 760235 are dead but not yet removable
tuples: 0 removed, 487 remain, 12 are dead but not yet removable
tuples: 0 removed, 3023757 remain, 760235 are dead but not yet removable
tuples: 0 removed, 487 remain, 12 are dead but not yet removable
tuples: 0 removed, 3135900 remain, 1137469 are dead but not yet removable
tuples: 0 removed, 490 remain, 15 are dead but not yet removable
索引和表也明显膨胀了：
postgres=# \dt+ tbl
                  List of relations
Schema | Name | Type  |  Owner |  Size  | Description
--------+------+-------+----------+--------+-------------
public | tbl  | table | postgres | 781 MB |
(1 row)

postgres=# \di+ tbl_pkey
                        List of relations
Schema | Name | Type  |  Owner | Table |  Size  | Description
--------+----------+-------+----------+-------+--------+-------------
public | tbl_pkey | index | postgres | tbl | 308 MB |
(1 row)
关闭这个事务后：这个事务期间没有被回收的垃圾可以正常回收
停之前，另外再开一个
postgres=# begin;
BEGIN
postgres=# select txid_current();
txid_current
--------------
314031042
(1 row)
然后停前面那个
postgres=# end;
COMMIT
这里主要为了说明，这样的连续事务，不会影响上一个事务过程造成影响的未回收的垃圾。
tuples: 0 removed, 481 remain, 6 are dead but not yet removable
tuples: 13629196 removed, 2515757 remain, 500001 are dead but not yet removable
tuples: 0 removed, 4845701 remain, 1000002 are dead but not yet removable
tuples: 0 removed, 7146782 remain, 1500003 are dead but not yet removable
后面启动的那个也停掉，垃圾完全回收了：
postgres=# end;
COMMIT
tuples: 7183691 removed, 11252550 remain, 0 are dead but not yet removable
tuples: 500001 removed, 6234858 remain, 0 are dead but not yet removable
但是表和索引已经膨胀了，无法收缩，除非使用rewrite table（vacuum full, cluster）或者pg_reorg,pg_repack这样的工具。
我再举一个实际可能存在的例子，例如持续的并发批量更新，也可能导致膨胀：
例如我有一个表包含100万条记录，分成10个进程，每个进程批量更新其中的10万条，并且持续不断的更新。
为什么说这样操作会引起膨胀呢，因为worker process 最小粒度是表级别的，同一张表同一时间只有一个进程在回收垃圾。这种场景会产生三个问题：
1. 瞬时产生垃圾的速度可能超过回收的速度，
2. 产生新TUPLE版本的需求超过FSM的剩余空间，
3. 回收过程中（其他进程可能会启动并发的更新，持有事务排他锁）会遇到不可回收的问题，就是前面这个例子提到的问题。
这几种原因会导致扩展新的数据块可能性变大，即膨胀。
测试例子：
postgres=# truncate tbl;
TRUNCATE TABLE
postgres=# insert into tbl select generate_series(1,1000000),md5(random()::text),clock_timestamp();
INSERT 0 1000000
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 73 MB
Description |

postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 21 MB
Description |

$ vi t1.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=1 and id<100000;
$ vi t2.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=100001 and id<200000;
$ vi t3.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=200001 and id<300000;
$ vi t4.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=300001 and id<400000;
$ vi t5.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=400001 and id<500000;
$ vi t6.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=500001 and id<600000;
$ vi t7.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=600001 and id<700000;
$ vi t8.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=700001 and id<800000;
$ vi t9.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=800001 and id<900000;
$ vi t10.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=900001 and id<=1000000;

pgbench -M prepared -n -r -f ./t1.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t2.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t3.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t4.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t5.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t6.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t7.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t8.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t9.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t10.sql -c 1 -j 1 -T 500000 &

观察到出现了不可回收的垃圾
tuples: 0 removed, 2049809 remain, 999991 are dead but not yet removable
tuples: 501373 removed, 2176172 remain, 999991 are dead but not yet removable
tuples: 1603158 removed, 2517367 remain, 899562 are dead but not yet removable
tuples: 405093 removed, 2647780 remain, 899992 are dead but not yet removable
tuples: 1100546 removed, 2724724 remain, 899562 are dead but not yet removable
tuples: 528200 removed, 2864735 remain, 1141307 are dead but not yet removable
tuples: 981628 removed, 2757909 remain, 933307 are dead but not yet removable

膨胀
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 554 MB
Description |

postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 114 MB
Description |
如果产生新TUPLE版本的需求超过FSM的剩余空间，还会继续膨胀下去。
这个问题的改进非常简单，将批量更新的粒度降低，即单个事务时间缩短，可以降低事务排他锁持有时间，减少not yet removable的情况，同时事务变小后，单个事务对剩余空间的需求量也变小了，所以不需要扩展数据块。就不会膨胀。
例子：
postgres=# create sequence seq cache 10;
CREATE SEQUENCE
postgres=# vacuum full tbl;
VACUUM

postgres@db-172-16-3-150-> vi test.sql
update tbl set info=info,crt_time=clock_timestamp() where id=mod(nextval('seq'),2000001);

为了提高速度，我这里将nextval改为immutable, 让以上SQL可以走索引扫描（生产环境请勿模仿）。
postgres=# alter function nextval(regclass) immutable;

postgres@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 20 -j 10 -T 500000

因为事务很短，所以只能看到少量的not yet removable，
同时因为一次更新带来的新版本需要的空间也比较小，所以不会超出FSM中的剩余空间，不需要EXTEND BLOCK。
tuples: 11116 removed, 977673 remain, 591 are dead but not yet removable
tuples: 12050 removed, 978437 remain, 979 are dead but not yet removable
tuples: 12687 removed, 981375 remain, 227 are dead but not yet removable
tuples: 12911 removed, 978912 remain, 831 are dead but not yet removable
tuples: 7 removed, 475 remain, 0 are dead but not yet removable
tuples: 13133 removed, 979761 remain, 522 are dead but not yet removable
tuples: 14419 removed, 977651 remain, 1077 are dead but not yet removable
tuples: 12111 removed, 978558 remain, 700 are dead but not yet removable

半小时后，并未出现膨胀
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 75 MB
Description |

postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 21 MB
Description |

2. autovacuum launcher process 唤醒时间太长会影响垃圾回收。
调整autovacuum_naptime = 1000， 1000秒的睡觉时间，看看怎么完蛋的吧。
pg_ctl reload
postgres=# vacuum full tbl;
VACUUM
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 145 MB
Description |

postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 43 MB
Description |

postgres@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 1 -j 1 -T 500000
现在不看日志，一段时间之后，你看看pg_stat_all_tables：
postgres@db-172-16-3-150-> psql
psql (9.4.1)
Type "help" for help.
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_all_tables where relname='tbl';
-[ RECORD 1 ]-------+------------------------------
relid             | 60511
schemaname       | public
relname          | tbl
seq_scan          | 6
seq_tup_read       | 12082648
idx_scan          | 116
idx_tup_fetch    | 49232974
n_tup_ins          | 2000000
n_tup_upd          | 50732969
n_tup_del          | 0
n_tup_hot_upd    | 81869
n_live_tup       | 2328301
n_dead_tup       | 7895450
n_mod_since_analyze | 7825343
last_vacuum       |
last_autovacuum    | 2015-04-29 18:02:45.325102+08
last_analyze       |
last_autoanalyze | 2015-04-29 18:02:45.596096+08
vacuum_count       | 0
autovacuum_count | 247
analyze_count    | 0
autoanalyze_count | 91
表已经膨胀了：
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 393 MB
Description |
postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 115 MB
Description |

[小结]
通过上面的分析，我们应该如何减少或避免PostgreSQL数据膨胀呢？
1. 一定要开启autovacuum。
2. 提高系统的IO能力，越高越好。
3. 调整触发阈值，让触发阈值和记录数匹配。调小autovacuum_vacuum_scale_factor和autovacuum_analyze_scale_factor。比如我想在有1万条垃圾记录后就触发垃圾回收，那么对于一个1000万的表来说，我应该把autovacuum_vacuum_scale_factor调到千分之一即0.001，而autovacuum_analyze_scale_factor应该调到0.0005。
4. 增加autovacuum_max_workers，同时增加autovacuum_work_mem，同时增加系统内存。
例如对于有大量表需要频繁更新的数据库集群，可以将autovacuum_max_workers调整为与CPU核数一致，并将autovacuum_work_mem调整为2GB，同时需要确保系统预留的内存大于autovacuum_max_workers*autovacuum_work_mem。
5. 应用程序设计时，尽量避免持有事务Exclusive锁的长事务（DDL,DML都会持有事务Exclusive锁）
6. 对于IO没有问题的系统，关闭autovacuum_vacuum_cost_delay。
7. 调整autovacuum_naptime参数到最低，如果还是唤醒时间太长，可以调整代码中的限制，例如改为1毫秒：
#define MIN_AUTOVAC_SLEEPTIME 1.0             /* milliseconds */
8. 应用程序设计时，避免使用大批量的更新，删除操作，可以切分为多个事务进行。
9. 使用大的数据块，对于现代的硬件水平，32KB是比较好的选择，fillfactor实际上不需要太关注，100就可以了，调低它其实没有必要，因为数据库总是有垃圾，也就是说每个块在被更新后实际上都不可能是满的。
10. 万一真的膨胀了，可以通过table rewrite来回收(如vacuum full, cluster)，但是需要迟排他锁。建议使用pg_reorg或者pg_repack来回收，实际上用到了交换 filenode可以缩短需要持有排他锁的时间。

[参考]
1. src/backend/postmaster/autovacuum.c

wanzai01 · 发表于 2015-6-12 12:45

阿里云RDS for PostgreSQL试用报告 - 3
上一篇测试中用了16个最低配置的RDS实例，测试中发现这16个实例实际上是分布在8台主机上（通过 inet_server_addr() IP区分）。测试使用plproxy做数据水平拆分，在16个实例中创建了32个DB，将6400万数据拆分到32个库。
http://blog.163.com/digoal@126/b ... 770402015599230431/
http://blog.163.com/digoal@126/b ... 704020155104492068/

接下来要做的是单个RDS实例（同样的配置）承载6400万数据的测试。
创建测试表，生成测试数据。
create table userinfo(userid int,info text);
create table session (userid int,last_login timestamp);
create table login_log (userid int,db_user name,client_addr inet,
                     client_port int,server_addr inet,server_port int,login_time timestamp);
create table tbl_small (userid int primary key,info text);

set synchronous_commit=off;
insert into userinfo select generate_series(1,32000000);
insert into session select generate_series(1,32000000);
insert into tbl_small select generate_series(1,500000);
set maintenance_work_mem='10GB';
alter table userinfo add constraint pk_userinfo primary key (userid);
alter table session add constraint pk_session primary key (userid);

postgres=> \dt+
                     List of relations
Schema |    Name    | Type  | Owner  |  Size | Description
--------+-----------------+-------+--------+---------+-------------
public | ha_health_check | table | aurora | 40 kB |
public | session       | table | digoal | 1106 MB |
public | userinfo       | table | digoal | 1106 MB |
(3 rows)
postgres=> \di+
                                 List of relations
Schema |       Name       | Type  | Owner  |    Table    |  Size  | Description
--------+----------------------+-------+--------+-----------------+--------+-------------
public | ha_health_check_pkey | index | aurora | ha_health_check | 16 kB  |
public | pk_session          | index | digoal | session       | 686 MB |
public | pk_userinfo       | index | digoal | userinfo       | 686 MB |
(3 rows)

测试中发现一个小小的问题，阿里云RDS限制了数据库进程的内存使用（包括shared buffers,work_mem,maintenance_work_mem, wal_buffers等限制），但是并不会限制OS层缓存的使用，也就是说我们的数据表对应的数据文件如果是热数据的话，可能被缓存好OS层缓存中，假如RDS能提供pgfincore插件就更完美了，不过在云环境中使用会造成内存争抢的情况。
下面我们看一个测试，实例只有256MB的shared buffer, 下面的查询却飞快。
postgres=> explain (analyze,verbose,timing,buffers,costs) select count(userid) from session;
                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=541593.00..541593.01 rows=1 width=4) (actual time=6574.761..6574.761 rows=1 loops=1)
Output: count(userid)
Buffers: shared hit=20229 read=121364
I/O Timings: read=227.803
->  Seq Scan on public.session  (cost=0.00..461593.00 rows=32000000 width=4) (actual time=0.029..3295.744 rows=32000001 loops=1)
      Output: userid, last_login
      Buffers: shared hit=20229 read=121364
      I/O Timings: read=227.803
Planning time: 0.044 ms
Execution time: 6574.794 ms
(10 rows)
postgres=> explain (analyze,verbose,timing,buffers,costs) select count(userid) from userinfo;
                                                         QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=541593.00..541593.01 rows=1 width=4) (actual time=6653.383..6653.383 rows=1 loops=1)
Output: count(userid)
Buffers: shared hit=64 read=141529
I/O Timings: read=265.700
->  Seq Scan on public.userinfo  (cost=0.00..461593.00 rows=32000000 width=4) (actual time=0.029..3358.069 rows=32000001 loops=1)
      Output: userid, info
      Buffers: shared hit=64 read=141529
      I/O Timings: read=265.700
Planning time: 0.046 ms
Execution time: 6653.417 ms
(10 rows)

分析这里的I/O Timings，单位毫秒，每次IO请求只需要0.0019毫秒。
这已经是内存级别的速度了。
postgres=> select 265.700/141529;
      ?column?
------------------------
0.00187735375788707615
(1 row)
postgres=> select 227.803/121364;
      ?column?
------------------------
0.00187702284038100260
(1 row)
离散扫描测试：
postgres=> set enable_seqscan=off;
SET
postgres=> explain (analyze,verbose,timing,buffers,costs) select count(userid) from userinfo;
                                                                        QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
--------------------------
Aggregate  (cost=1052572.56..1052572.57 rows=1 width=4) (actual time=10343.801..10343.801 rows=1 loops=1)
Output: count(userid)
Buffers: shared read=229028
I/O Timings: read=674.634
->  Index Only Scan using pk_userinfo on public.userinfo  (cost=0.56..972572.56 rows=32000000 width=4) (actual time=0.082..7277.8
18 rows=32000001 loops=1)
      Output: userid
      Heap Fetches: 32000001
      Buffers: shared read=229028
      I/O Timings: read=674.634
Planning time: 0.035 ms
Execution time: 10343.851 ms
(11 rows)
postgres=> explain (analyze,verbose,timing,buffers,costs) select count(userid) from session;
                                                                     QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
------------------------
Aggregate  (cost=1052572.56..1052572.57 rows=1 width=4) (actual time=10321.901..10321.901 rows=1 loops=1)
Output: count(userid)
Buffers: shared read=229028
I/O Timings: read=633.969
->  Index Only Scan using pk_session on public.session  (cost=0.56..972572.56 rows=32000000 width=4) (actual time=0.080..7268.908
rows=32000001 loops=1)
      Output: userid
      Heap Fetches: 32000001
      Buffers: shared read=229028
      I/O Timings: read=633.969
Planning time: 0.056 ms
Execution time: 10321.935 ms
(11 rows)
分析这里的I/O Timings，单位毫秒，每次IO请求只需要0.0028毫秒。
postgres=> select 633.969/229028;
      ?column?
------------------------
0.00276808512496288663
(1 row)
postgres=> select 674.634/229028;
      ?column?
------------------------
0.00294563983443072463
(1 row)
如果这些数据不是在内存中，那么有这样IOPS能力的块设备，那也是怪兽级别的了(8K的数据块，离散读IOPS达到36万，未考虑read ahead，考虑的话一般默认预读是256个扇区，真实IOPS能力会略低)。
我个人的判断还是倾向阿里的RDS未限制OS层CACHE，也就是随你用。

创建测试函数：
CREATE OR REPLACE FUNCTION query_pk(IN i_userid int, OUT userid int, OUT info text)
   RETURNS record
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      select t.userid,t.info into userid,info from userinfo t where t.userid=i_userid;
      return;
   end;
$function$;

CREATE OR REPLACE FUNCTION insert_log(IN i_userid int)
   RETURNS void
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      set synchronous_commit=off;
      insert into login_log (userid,db_user,client_addr,client_port,server_addr,server_port,login_time)
      values (i_userid,current_user,inet_client_addr(),inet_client_port(),inet_server_addr(),inet_server_port(),now());
   end;
$function$;

CREATE OR REPLACE FUNCTION query_insert(IN i_userid int, OUT userid int, OUT info text)
   RETURNS record
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      set synchronous_commit=off;
      select t.userid,t.info into userid,info from userinfo t where t.userid=i_userid;
      insert into login_log (userid,db_user,client_addr,client_port,server_addr,server_port,login_time)
      values (i_userid,current_user,inet_client_addr(),inet_client_port(),inet_server_addr(),inet_server_port(),now());
      return;
   end;
$function$;

CREATE OR REPLACE FUNCTION update_pk(IN i_userid int)
   RETURNS void
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      set synchronous_commit=off;
      update session t set last_login=now() where t.userid=i_userid;
   end;
$function$;

CREATE OR REPLACE FUNCTION query_update_insert(IN i_userid int, OUT userid int, OUT info text)
   RETURNS record
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      set synchronous_commit=off;
      select t.userid,t.info into userid,info from userinfo t where t.userid=i_userid;
      insert into login_log (userid,db_user,client_addr,client_port,server_addr,server_port,login_time)
      values (i_userid,current_user,inet_client_addr(),inet_client_port(),inet_server_addr(),inet_server_port(),now());
      update session t set last_login=now() where t.userid=i_userid;
      return;
   end;
$function$;

CREATE OR REPLACE FUNCTION query_smalltbl(IN i_userid int, OUT userid int, OUT info text)
   RETURNS record
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      select t.userid,t.info into userid,info from tbl_small t where t.userid=i_userid;
      return;
   end;
$function$;

测试结果：
vi test.sql
\setrandom id 1 32000000
select query_pk(:id);
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 88 -j 88 -T 20 -h xxxx.pg.rds.aliyuncs.com -p 3433 -U digoal postgres
progress: 1.1 s, 1938.5 tps, lat 30.861 ms stddev 63.730
progress: 2.1 s, 3397.7 tps, lat 26.197 ms stddev 43.067
progress: 3.0 s, 3293.2 tps, lat 25.744 ms stddev 36.761
progress: 4.2 s, 3477.7 tps, lat 26.012 ms stddev 44.032
progress: 5.1 s, 3448.3 tps, lat 25.291 ms stddev 39.993
progress: 6.0 s, 3581.1 tps, lat 24.386 ms stddev 53.515
progress: 7.0 s, 3669.4 tps, lat 23.736 ms stddev 43.620
progress: 8.1 s, 3635.0 tps, lat 24.333 ms stddev 54.772
progress: 9.0 s, 3625.6 tps, lat 24.457 ms stddev 39.071
progress: 10.0 s, 3708.4 tps, lat 23.017 ms stddev 41.434

vi test.sql
\setrandom id 1 32000000
select insert_log(:id);
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 88 -j 88 -T 20 -h xxxx.pg.rds.aliyuncs.com -p 3433 -U digoal postgres
progress: 1.1 s, 2194.8 tps, lat 26.288 ms stddev 51.427
progress: 2.0 s, 3841.0 tps, lat 22.859 ms stddev 37.456
progress: 3.0 s, 3745.8 tps, lat 23.536 ms stddev 46.164
progress: 4.0 s, 3843.2 tps, lat 22.481 ms stddev 37.077
progress: 5.0 s, 3676.9 tps, lat 24.256 ms stddev 45.177
progress: 6.1 s, 3838.0 tps, lat 22.898 ms stddev 38.825
progress: 7.0 s, 3890.9 tps, lat 22.836 ms stddev 38.612
progress: 8.0 s, 3590.9 tps, lat 24.565 ms stddev 43.551
progress: 9.0 s, 3675.0 tps, lat 24.210 ms stddev 38.266
progress: 10.1 s, 3812.7 tps, lat 22.507 ms stddev 36.516

vi test.sql
\setrandom id 1 32000000
select query_insert(:id);
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 88 -j 88 -T 20 -h xxxx.pg.rds.aliyuncs.com -p 3433 -U digoal postgres
progress: 1.1 s, 1269.2 tps, lat 45.745 ms stddev 89.929
progress: 2.1 s, 2700.4 tps, lat 33.356 ms stddev 58.091
progress: 3.0 s, 2654.6 tps, lat 35.314 ms stddev 54.011
progress: 4.0 s, 2673.0 tps, lat 31.859 ms stddev 48.704
progress: 5.0 s, 2762.7 tps, lat 31.759 ms stddev 51.929
progress: 6.1 s, 2667.7 tps, lat 32.047 ms stddev 55.966
progress: 7.1 s, 2688.7 tps, lat 32.407 ms stddev 58.218
progress: 8.2 s, 2785.4 tps, lat 30.795 ms stddev 65.419
progress: 9.0 s, 2789.9 tps, lat 35.547 ms stddev 58.010
progress: 10.0 s, 2879.6 tps, lat 30.196 ms stddev 53.233

vi test.sql
\setrandom id 1 32000000
select update_pk(:id);
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 88 -j 88 -T 20 -h xxxx.pg.rds.aliyuncs.com -p 3433 -U digoal postgres
progress: 2.5 s, 282.4 tps, lat 218.387 ms stddev 495.226
progress: 5.8 s, 94.8 tps, lat 787.358 ms stddev 1325.987
progress: 5.8 s, 15727.4 tps, lat 150.434 ms stddev 668.515
progress: 5.9 s, 945.4 tps, lat 769.080 ms stddev 1374.084
progress: 16.1 s, 93.2 tps, lat 833.108 ms stddev 1856.263
progress: 16.2 s, 2598.5 tps, lat 665.837 ms stddev 1693.883
progress: 17.2 s, 71.7 tps, lat 1571.432 ms stddev 1858.991
progress: 22.2 s, 29.9 tps, lat 3003.451 ms stddev 2389.133

vi test.sql
\setrandom id 1 32000000
select query_update_insert(:id);
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 88 -j 88 -T 20 -h xxxx.pg.rds.aliyuncs.com -p 3433 -U digoal postgres
progress: 5.7 s, 144.2 tps, lat 563.075 ms stddev 1426.395
progress: 5.8 s, 1292.3 tps, lat 133.407 ms stddev 609.956
progress: 5.8 s, 1028.1 tps, lat 29.967 ms stddev 37.131
progress: 11.3 s, 25.5 tps, lat 2265.784 ms stddev 2573.469
progress: 11.3 s, 6079.0 tps, lat 9.619 ms stddev 9.293
progress: 11.3 s, 4787.2 tps, lat 624.805 ms stddev 1740.448
progress: 16.9 s, 98.1 tps, lat 867.968 ms stddev 1989.390
progress: 17.1 s, 1313.4 tps, lat 870.720 ms stddev 2098.172
progress: 17.1 s, 13863.8 tps, lat 65.169 ms stddev 56.996
progress: 17.1 s, 11670.3 tps, lat 20.520 ms stddev 35.188

postgres=> \timing
Timing is on.
postgres=> select count(*) from login_log;
  count
--------
140456
(1 row)
Time: 28.747 ms
postgres=> select count(*) from userinfo;
  count
----------
32000001
(1 row)
Time: 3141.289 ms

vi test.sql
\setrandom id 1 32000000
select query_smalltbl(:id);
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 88 -j 88 -T 20 -h xxxx.pg.rds.aliyuncs.com -p 3433 -U digoal postgres
progress: 1.0 s, 2420.4 tps, lat 23.557 ms stddev 45.623
progress: 2.0 s, 4337.3 tps, lat 19.923 ms stddev 37.168
progress: 3.0 s, 4555.2 tps, lat 20.154 ms stddev 35.738
progress: 4.0 s, 4362.4 tps, lat 20.094 ms stddev 40.591
progress: 5.1 s, 4203.5 tps, lat 20.386 ms stddev 36.220
progress: 6.0 s, 4484.5 tps, lat 19.888 ms stddev 36.724
progress: 7.0 s, 4551.6 tps, lat 19.634 ms stddev 39.959
progress: 8.0 s, 4041.8 tps, lat 21.195 ms stddev 40.362
progress: 9.1 s, 4557.6 tps, lat 19.758 ms stddev 37.218
progress: 10.0 s, 4349.1 tps, lat 20.254 ms stddev 34.562

测试结果与使用plproxy分布式处理的对比：
性能提升非常明显。

再报几个阿里云RDS for PostgreSQL的问题：
1. 当容量超出时（例如执行一个大的插入，我在生成测试数据时遇到），数据库会被KILL掉，数据库重启并恢复。
恢复时间有点长，约30分钟，(恢复过程中建议不要限制IOPS，尽快恢复才是王道)并且恢复后，还会有很长一段时间处于recovery状态，不清楚这个时间是否是切换到standby了，难道需要后台人工处理？未做自动化吗？
postgres=> select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
同时这点过于暴力，是不是可以给用户提个醒，和用户协商一下呢？给用户一个时间窗口，让用户自己处理。

2. 容量计算方法有点不合常理，容量包括了数据，XLOG，日志，。。。而实际上对用户来说应该只关注数据，至于xlog不应该让用户买单，因为xlog是用于保障可靠性和数据恢复的。归档后就可以复用，同时最低配置的RDS配置了
wal_keep_segments=80
checkpoint_segments=64
那么xlog最多会有64*2+1=129个xlog文件，即129*16MB = 2064MB。
什么情况下会有这么多呢？当checkpoint周期拉很长时，例如配置了 checkpoint_completion_target = 0.9，这个周期就非常长，接近整个检查点的周期。
用户可通过我之前写的文章了解这里的含义：
http://blog.163.com/digoal@126/b ... 040201542103933969/
http://blog.163.com/digoal@126/b ... 770402015463252387/
http://blog.163.com/digoal@126/b ... 704020154651655783/
http://blog.163.com/digoal@126/b ... 704020154653422892/
http://blog.163.com/digoal@126/b ... 704020154811421484/
http://blog.163.com/digoal@126/b ... 704020154129958753/

实际上RDS的可用空间在最紧张时就变成了5GB-2064MB = 3056MB。
当然，空下来之后，xlog会回收掉一些，差不多一半，所以闲下来后还可以加1G左右的可用空间，即4GB左右，这是你能使用的最大空间了。
对于更高配置的RDS，wal_keep_segments和checkpoint_segments配置会更大，这种计算空间的方法，对用户来说浪费空间更多。
建议阿里云不计XLOG的空间，只计它的IOPS。

3. OS层缓存，这个已经说了，阿里云RDS目前可能没有限制OS层缓存，所以尽情享受吧。

4. IOPS限制间隔，在测试UPDATE时，性能非常不稳，可能是IOPS限制间隔或者手段造成的，当然也可能是FPW造成的，因为没有RDS所在服务器的权限，没有办法调试，所以基本靠猜。

5. 为什么我们的plproxy没有看到所有测试的线性性能提升(目测某些只有5到8倍的性能提升，某些有超过16倍的性能提升)，因为阿里云RDS并没有限制CPU的使用率，只限制了共享内存和IOPS，那么有些节点所在的机器可能CPU资源较空，有些较忙，我们的测试虽然是随机的分发到各个节点，但是因测试线程是共享的，所以单个实例如果比较慢，对总体测试结果有一定的影响。
而对于IO类的测试，性能提升是达到16倍的。
排除这些影响，使用plproxy应该是线性提升的，我在以前的分享中有数据可供查看，有兴趣的朋友可以看我以前的一些分享。

先写到这里，下一篇来讲讲如何增加或减少数据节点。

[参考]
1. http://blog.163.com/digoal@126/b ... 770402015599230431/
2. http://blog.163.com/digoal@126/b ... 704020155104492068/

wanzai01 · 发表于 2015-6-12 12:46

阿里云RDS for PostgreSQL试用报告 - 4
我前面用16个RDS，创建了32个数据库，每个RDS上有2个库，也就是说，我最多可以扩展到32个RDS。
下面将演示一下扩展(或者称为数据分片节点迁移)的过程。
实际上RDS现在还欠缺一个功能，就是数据库克隆，你可以这么理解，给现有的数据库创建STANDBY，然后将这个STANDBY激活，就完成了对数据库的克隆。
为什么我们需要数据库克隆功能呢？
这会使得数据库的扩容变得非常简单，比如我们这里的应用场景，如果要将16个RDS，变成32个RDS，那么克隆无疑是最好的办法。因为不需要做逻辑数据迁移的事情，只需要删除不需要的数据库，以及调整plproxy的cluster配置即可。
我们先假设RDS有创建STANDBYD的功能（相信未来会增加），看看如何来实现RDS的扩容。

假设主RDS1包含db0,db16两个库，现在要拆分为两个RDS，RDS1(db0), RDS2(db16)，分别包含db0和db16。
1. 为需要拆分的主RDS创建standby RDS, 确认流复制正常，确认standby RDS到达catchup状态。

2. 配置密码文件~/.pgpass，包含克隆库的认证信息。

3. 调整plproxy cluster配置。使用只读事务连接需要迁移的数据库，避免写操作带来的主备不一致。
例如 :
alter server rds_pg_cluster options (set p16 'host=old_rds_hostname dbname=db16 port=3433 user=digoal keepalives_idle=30 keepalives_interval=10 keepalives_count=10 application_name=16 options=''-c default_transaction_read_only=true'' ');

4. 确认主RDS需迁移的库(db16)没有连接，确认standby处于catchup状态。

5. 激活standby。

6. 调整plproxy cluster配置。原连接RDS1(db16)，修改为RDS2(db16)。
例如 :
alter server rds_pg_cluster options (set p16 'host=new_rds_hostname dbname=db16 port=3433 user=digoal keepalives_idle=30 keepalives_interval=10 keepalives_count=10 application_name=16');

7. 删除主RDS节点已迁移的数据库(db16)，删除standby节点多余的数据库(db0)。

循环1-7，将所有RDS拆分。

回到现实，现实是目前RDS没有提供克隆功能。那么我们需要人工实现数据迁移，需迁移的东西还挺多，包括表，视图，函数，。。。。。可以用pg_dump，但是怎么实现增量呢？总不能停很久吧。
对于有主键或唯一约束的表，可以用物化视图来做增量迁移。然后修改数据字典(需要超级用户，阿里云RDS目前提供给用户的是普通用户)（有风险,请知晓）。
http://blog.163.com/digoal@126/b ... 040201559105235803/

下面是人力扩容的例子：
源：
postgres=> select datname,pg_database_size(oid)/1024/1024 from pg_database;
  datname  | ?column?
-----------+----------
template1 |       6
template0 |       6
postgres  |    3618
digoal |       6
db7    |    179
db23    |    179
(6 rows)

目标，把db23迁移到以下RDS：
postgres=> select datname,pg_database_size(oid)/1024/1024 from pg_database;
  datname  | ?column?
-----------+----------
template1 |       6
template0 |       6
postgres  |       6
digoal |       6
db8    |    179
db24    |    179
(6 rows)

在目标RDS创建db23数据库：
postgres=> create database db23;
CREATE DATABASE
postgres=> \c db23
psql (9.4.3, server 9.4.1)
You are now connected to database "db23" as user "digoal".
创建schema和需要迁移的函数：
db23=> create schema digoal;
CREATE SCHEMA
db23=> CREATE OR REPLACE FUNCTION digoal.dy(sql text)
db23->    RETURNS SETOF record
db23->    LANGUAGE plpgsql
db23->    STRICT
db23->    AS $function$
db23$>    declare
db23$>       rec record;
db23$>    begin
db23$>       for rec in execute sql loop
db23$>          return next rec;
db23$>       end loop;
db23$>       return;
db23$>    end;
db23$>    $function$;
CREATE FUNCTION
db23=> CREATE OR REPLACE FUNCTION digoal.dy_ddl(sql text)
db23->    RETURNS VOID
db23->    LANGUAGE plpgsql
db23->    STRICT
db23->    AS $function$
db23$>    declare
db23$>    begin
db23$>       execute sql;
db23$>       return;
db23$>    exception when others then return;
db23$>    end;
db23$>    $function$;
CREATE FUNCTION
准备需要迁移的数据的外部表：
db23=> create extension postgres_fdw;
CREATE EXTENSION

CREATE SERVER db23
      FOREIGN DATA WRAPPER postgres_fdw
      OPTIONS (host 'old.pg.rds.aliyuncs.com', port '3433', dbname 'db23');

CREATE USER MAPPING FOR digoal  --  locale user
      SERVER db23
      OPTIONS (user 'digoal', password 'digoal');  --  remote user/password

CREATE FOREIGN TABLE digoal.ft_userinfo (
      dbid int default 23,
      userid int,
info text
)
      SERVER db23
      OPTIONS (schema_name 'digoal', table_name 'userinfo');

CREATE FOREIGN TABLE digoal.ft_session (
      dbid int default 23,
      userid int,
last_login timestamp without time zone
)
      SERVER db23
      OPTIONS (schema_name 'digoal', table_name 'session');

CREATE FOREIGN TABLE digoal.ft_tbl_small (
      userid int,
info text
)
      SERVER db23
      OPTIONS (schema_name 'digoal', table_name 'tbl_small');

CREATE FOREIGN TABLE digoal.ft_login_log (
      dbid int default 23,
      userid int,
      db_user name,
      client_addr inet,
      client_port int,
      server_addr inet,
      server_port int,
      login_time timestamp without time zone
)
      SERVER db23
      OPTIONS (schema_name 'digoal', table_name 'login_log');

创建物化视图，日志表（日志数据不迁移）
CREATE MATERIALIZED VIEW digoal.userinfo (
      dbid ,
      userid ,
info
) as select * from digoal.ft_userinfo;
set maintenance_work_mem='10GB';
create unique index pk_userinfo on digoal.userinfo (userid);

CREATE MATERIALIZED VIEW digoal.session (
      dbid ,
      userid ,
last_login
) as select * from digoal.ft_session;
set maintenance_work_mem='10GB';
create unique index pk_session on digoal.session (userid);

CREATE MATERIALIZED VIEW digoal.tbl_small (
      userid ,
info
) as select * from digoal.ft_tbl_small;
set maintenance_work_mem='10GB';
create unique index pk_tbl_small on digoal.tbl_small (userid);

CREATE TABLE digoal.login_log (
      dbid int default 23,
      userid int,
      db_user name,
      client_addr inet,
      client_port int,
      server_addr inet,
      server_port int,
      login_time timestamp without time zone
);

创建需要迁移的函数：
CREATE OR REPLACE FUNCTION digoal.dy_generate_test_ddl()
   RETURNS VOID
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
      node int;
sql text;
   begin
      select application_name::int into node from pg_stat_activity where pid=pg_backend_pid();
sql := $a$insert into digoal.userinfo select $a$||node||$a$,generate_series($a$||node||$a$,32000000,32)$a$;
execute sql;
sql := $a$insert into digoal.session select dbid,userid from digoal.userinfo$a$;
execute sql;
      return;
   exception when others then return;
   end;
$function$;

CREATE OR REPLACE FUNCTION digoal.query_pk(IN i_userid int, OUT dbid int, OUT userid int, OUT info text)
   RETURNS record
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      select t.dbid,t.userid,t.info into dbid,userid,info from digoal.userinfo t where t.userid=i_userid;
      return;
   end;
$function$;

CREATE OR REPLACE FUNCTION digoal.insert_log(IN i_userid int)
   RETURNS void
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      insert into digoal.login_log (userid,db_user,client_addr,client_port,server_addr,server_port,login_time)
      values (i_userid,current_user,inet_client_addr(),inet_client_port(),inet_server_addr(),inet_server_port(),now());
   end;
$function$;

CREATE OR REPLACE FUNCTION digoal.query_insert(IN i_userid int, OUT dbid int, OUT userid int, OUT info text)
   RETURNS record
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      select t.dbid,t.userid,t.info into dbid,userid,info from digoal.userinfo t where t.userid=i_userid;
      insert into digoal.login_log (userid,db_user,client_addr,client_port,server_addr,server_port,login_time)
      values (i_userid,current_user,inet_client_addr(),inet_client_port(),inet_server_addr(),inet_server_port(),now());
      return;
   end;
$function$;

CREATE OR REPLACE FUNCTION digoal.update_pk(IN i_userid int)
   RETURNS void
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      update digoal.session t set last_login=now() where t.userid=i_userid;
   end;
$function$;

CREATE OR REPLACE FUNCTION digoal.query_update_insert(IN i_userid int, OUT dbid int, OUT userid int, OUT info text)
   RETURNS record
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      select t.dbid,t.userid,t.info into dbid,userid,info from digoal.userinfo t where t.userid=i_userid;
      insert into digoal.login_log (userid,db_user,client_addr,client_port,server_addr,server_port,login_time)
      values (i_userid,current_user,inet_client_addr(),inet_client_port(),inet_server_addr(),inet_server_port(),now());
      update digoal.session t set last_login=now() where t.userid=i_userid;
      return;
   end;
$function$;

CREATE OR REPLACE FUNCTION digoal.query_smalltbl(IN i_userid int, OUT userid int, OUT info text)
   RETURNS record
   LANGUAGE plpgsql
   STRICT
AS $function$
   declare
   begin
      select t.userid,t.info into userid,info from digoal.tbl_small t where t.userid=i_userid;
      return;
   end;
$function$;

PL/Proxy节点操作如下：
配置.pgpass, 新增：
new.pg.rds.aliyuncs.com:3433:*:digoal:digoal
使用default_transaction_read_only默认读事务，屏蔽写操作.
在迁移时间段内，用户可以正常执行读请求，但是如果执行写请求会失败，这样确保数据迁移的一致性，同时降低迁移过程对业务的影响。
postgres# alter server rds_pg_cluster options (set p23 'host=old.pg.rds.aliyuncs.com dbname=db23 port=3433 user=digoal keepalives_idle=30 keepalives_interval=10 keepalives_count=10 application_name=23 options=''-c default_transaction_read_only=true'''); -- 注意里面是两个单引号
测试读正常，写失败：
postgres=# select query_pk(23);
query_pk
----------
(23,23,)
(1 row)

postgres=# select insert_log(23);
ERROR:  public.insert_log(1): [db23] REMOTE ERROR: cannot execute INSERT in a read-only transaction
CONTEXT:  Remote context: SQL statement "insert into digoal.login_log (userid,db_user,client_addr,client_port,server_addr,server_port,login_time)
values (i_userid,current_user,inet_client_addr(),inet_client_port(),inet_server_addr(),inet_server_port(),now())"
PL/pgSQL function insert_log(integer) line 4 at SQL statement

new RDS执行操作如下，刷新物化视图:
refresh materialized view CONCURRENTLY digoal.userinfo;
REFRESH MATERIALIZED VIEW
Time: 10953.220 ms
refresh materialized view CONCURRENTLY digoal.session;
REFRESH MATERIALIZED VIEW
Time: 11013.860 ms
refresh materialized view CONCURRENTLY digoal.tbl_small;
REFRESH MATERIALIZED VIEW
Time: 5084.118 ms

这里卡住，因为要修改数据字典需要超级用户。而RDS提供的用户是普通用户，无法修改数据字典（虽然有风险，这里只为演示）。所以这样迁移行不通。
如果是超级用户，那么操作请参考我前期写的BLOG
http://blog.163.com/digoal@126/b ... 040201559105235803/

为了演示下去，我只能选择全量迁移。（其他增量方法也有，本文不演示）
db23=> drop materialized view digoal.session ;
DROP MATERIALIZED VIEW
Time: 16.528 ms
db23=> drop materialized view digoal.userinfo;
DROP MATERIALIZED VIEW
Time: 15.781 ms
db23=> drop materialized view digoal.tbl_small;
DROP MATERIALIZED VIEW
Time: 9.458 ms

为了提高迁移速度，用了一些手段。
set synchronous_commit=off;
set maintenance_work_mem='10GB';
CREATE TABLE digoal.userinfo (
      dbid int default 23,
      userid int,
info text
) with (autovacuum_enabled=off, toast.autovacuum_enabled=off);

CREATE TABLE digoal.session (
      dbid int default 23,
      userid int,
last_login timestamp without time zone
) with (autovacuum_enabled=off, toast.autovacuum_enabled=off);

CREATE TABLE digoal.tbl_small (
      userid int,
info text
) with (autovacuum_enabled=off, toast.autovacuum_enabled=off);

insert into digoal.userinfo select * from digoal.ft_userinfo;
INSERT 0 1000000
Time: 45290.701 ms
insert into digoal.session select * from digoal.ft_session;
INSERT 0 1000000
Time: 42212.278 ms
insert into digoal.tbl_small select * from digoal.ft_tbl_small;
INSERT 0 500000
Time: 22885.456 ms

alter table digoal.userinfo add constraint pk_userinfo primary key (userid);
ALTER TABLE
Time: 16962.174 ms
alter table digoal.session add constraint pk_session primary key (userid);
ALTER TABLE
Time: 20809.422 ms
alter table digoal.tbl_small add constraint pk_tbl_small primary key (userid);
ALTER TABLE
Time: 17484.201 ms

vacuum analyze digoal.userinfo;
Time: 65.790 ms
vacuum analyze digoal.session;
Time: 65.427 ms
vacuum analyze digoal.tbl_small;
Time: 45.453 ms

alter table digoal.userinfo set (autovacuum_enabled=on, toast.autovacuum_enabled=on);
alter table digoal.session set (autovacuum_enabled=on, toast.autovacuum_enabled=on);
alter table digoal.tbl_small set (autovacuum_enabled=on, toast.autovacuum_enabled=on);

PL/Proxy,  修改集群，db23的目标主机为新的RDS，并且开放读写权限：
postgres# alter server rds_pg_cluster options (set p23 'host=old.pg.rds.aliyuncs.com dbname=db23 port=3433 user=digoal keepalives_idle=30 keepalives_interval=10 keepalives_count=10 application_name=23');

删除老RDS上的db23.
psql -h old.pg.rds.aliyuncs.com -p 3433 -U digoal postgres
postgres=> drop database db23;
DROP DATABASE

测试plproxy分发是否正常分发到新的数据库：
postgres=# select * from query_pk(23);
dbid | userid | info
------+--------+------
23 |    23 |
(1 row)

vi test.sql
\setrandom id 1 32000000
select insert_log(:id);
pgbench -M prepared -n -r -f ./test.sql -P 1 -c 30 -j 30 -T 30
progress: 1.0 s, 7853.2 tps, lat 3.340 ms stddev 6.056
progress: 2.0 s, 10766.4 tps, lat 2.432 ms stddev 5.433
progress: 3.0 s, 11395.0 tps, lat 2.277 ms stddev 4.590
progress: 4.0 s, 11622.1 tps, lat 2.216 ms stddev 4.493
progress: 5.0 s, 10519.9 tps, lat 2.454 ms stddev 5.600
progress: 6.0 s, 11153.4 tps, lat 2.317 ms stddev 4.795
progress: 7.0 s, 11474.3 tps, lat 2.312 ms stddev 4.802
progress: 8.0 s, 11398.5 tps, lat 2.253 ms stddev 4.308
progress: 9.0 s, 12106.7 tps, lat 2.174 ms stddev 3.302
progress: 10.0 s, 12567.8 tps, lat 2.068 ms stddev 2.075

连接到新的db23：
psql -h new_rds -p 3433 -U digoal
postgres=> \c db23
psql (9.4.3, server 9.4.1)
You are now connected to database "db23" as user "digoal".
db23=> select count(*) from login_log ;
count
-------
10547
(1 row)

[参考]
1. http://blog.163.com/digoal@126/b ... 040201559105235803/
2. http://www.postgresql.org/docs/9.4/static/libpq-connect.html