【赢IPAD+1000元】PostgreSQL的互联网+特性如何跟业务更好的结合｜征文活动

wanzai01 · 发表于 2015-6-4 14:11

本帖最后由 wanzai01 于 2015-6-4 17:18 编辑

PostgreSQL 和 Oracle的小事务，高并发的性能压测例子 - 之PostgreSQL

基于同一台主机和存储，分别测试PostgreSQL 9.4.1, Oracle 12c 的小事务处理能力。
测试结果仅供参考，有兴趣的同学可以自行测试或者更改测试用例来玩。
(因测试使用工具不一样，工具本身的损耗不一样，结果可能没有可比性。)
(即使用同样的工具，驱动的性能可能也不一样，很难做到完全没有偏颇。)
(所以，本文目的旨在挑战产品自身的极限或者发现自身的问题和缺陷，而非两种产品的VS，纯属娱乐。)

压力测试结果汇总：
PostgreSQL 9.4.1：
UPDATE
平均TPS：60217
最小TPS：27666
最大TPS：65708
SELECT
平均TPS：328895
最小TPS：327336
最大TPS：330360
INSERT
平均TPS：70433
最小TPS：57417.4
最大TPS：75758.9

Oracle 12c：
详见：
http://blog.163.com/digoal@126/b ... 704020154431045764/
UPDATE
平均TPS：32xxx
最小TPS：29000
最大TPS：33900
SELECT
平均TPS：36xxx
最小TPS：36300
最大TPS：36620
INSERT
平均TPS：9xxx
最小TPS：8750
最大TPS：10500

[ 测试详情 ]
压力测试内容：
基于主键的查询，更新。
带主键的表的插入。

测试环境：
服务器 2009年购买的 IBM X3950, 和现在的CPU比起来性能已经比较差了.
CPU 4 * 6核 Intel(R) Xeon(R) CPU X7460 @ 2.66GHz
内存 32 * 4GB DDR2 533MHz
硬盘上海宝存 1.2TB Direct-IO PCI-E SSD
数据库 PostgreSQL 9.4.1
操作系统 CentOS 6.6 x64
文件系统 EXT4, noatime,nodiratime,nobarrier
更新,查询数据量 5000万
插入数据量 100亿
PostgreSQL 数据库参数：
listen_addresses = '0.0.0.0'          # what IP address(es) to listen on;
port = 1921                            # (change requires restart)
max_connections = 56                   # (change requires restart)
superuser_reserved_connections = 13    # (change requires restart)
unix_socket_directories = '.' # comma-separated list of directories
tcp_keepalives_idle = 60             # TCP_KEEPIDLE, in seconds;
tcp_keepalives_interval = 10          # TCP_KEEPINTVL, in seconds;
tcp_keepalives_count = 10             # TCP_KEEPCNT;
shared_buffers = 8GB                # min 128kB  内存足够大的情况下,配置为和数据库的活跃数据量相当即可获得最好的性能.
huge_pages = try                      # on, off, or try
maintenance_work_mem = 1GB             # min 1MB
autovacuum_work_mem = 1GB             # min 1MB, or -1 to use maintenance_work_mem
dynamic_shared_memory_type = posix    # the default is the first option
bgwriter_delay = 10ms                # 10-10000ms between rounds
synchronous_commit = off             # synchronization level;
wal_sync_method = fdatasync          # the default is the first option
wal_buffers = 16MB                   # min 32kB, -1 sets based on shared_buffers
wal_writer_delay = 10ms       # 1-10000 milliseconds
checkpoint_timeout = 10min          # 对于产生XLOG非常频繁的数据库, 为了降低性能影响, 可以配置为大于产生checkpoint_segments需要的周期.
                                 # 例如产生512个XLOG需要8分钟, 那么这里可以配置为超过8分钟.
                                 # 这里配置的时间越长, 如果数据库DOWN机, 恢复需要的时间也越长.
checkpoint_completion_target = 0.9
checkpoint_segments = 512             # in logfile segments, min 1, 16MB each  配置为大于等于shared_buffers
random_page_cost = 2.0                # same scale as above
effective_cache_size = 100GB
log_destination = 'csvlog'             # Valid values are combinations of
logging_collector = on       # Enable capturing of stderr and csvlog
log_truncate_on_rotation = on          # If on, an existing log file with the
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose          # terse, default, or verbose messages
log_timezone = 'PRC'
autovacuum = on                # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_vacuum_scale_factor = 0.002  # fraction of table size before vacuum , 对于产生垃圾非常频繁的库, 越小越好
autovacuum_analyze_scale_factor = 0.001 # fraction of table size before analyze
autovacuum_vacuum_cost_delay = 0ms    # default vacuum cost delay for , 对于IO能力非常好的库, 不要延迟
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'                      # locale for system error message
lc_monetary = 'C'                      # locale for monetary formatting
lc_numeric = 'C'                      # locale for number formatting
lc_time = 'C'                         # locale for time formatting
default_text_search_config = 'pg_catalog.english'
生成查询，更新压力测试数据：
digoal=> create table tbl(id int, info text, crt_time timestamptz default now()) tablespace tbs_digoal;
CREATE TABLE
digoal=> insert into tbl select generate_series(1,50000000),now(),now();
INSERT 0 50000000
digoal=> set maintenance_work_mem='4GB';
SET
digoal=> alter table tbl add constraint tbl_pkey primary key(id) using index tablespace tbs_digoal_idx;
ALTER TABLE
digoal=> \dt+ tbl
               List of relations
Schema | Name | Type  | Owner  |  Size | Description
--------+------+-------+--------+---------+-------------
digoal | tbl  | table | digoal | 3634 MB |
(1 row)
digoal=> \di+ tbl_pkey
                     List of relations
Schema | Name | Type  | Owner  | Table |  Size | Description
--------+----------+-------+--------+-------+---------+-------------
digoal | tbl_pkey | index | digoal | tbl | 1063 MB |
(1 row)
根据主键进行更新测试，测试时长超过8小时。
$ vi test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id=:id;

nohup pgbench -M prepared -n -f test.sql -P 10 -c 26 -j 26 -T 30000000 >./log 2>&1 &
超过8小时的测试后，表大了100多MB，索引未变化。
digoal=> \dt+
               List of relations
Schema | Name | Type  | Owner  |  Size | Description
--------+------+-------+--------+---------+-------------
digoal | tbl  | table | digoal | 3842 MB |
(1 rows)
digoal=> \di+
                     List of relations
Schema | Name | Type  | Owner  | Table |  Size | Description
--------+----------+-------+--------+-------+---------+-------------
digoal | tbl_pkey | index | digoal | tbl | 1063 MB |
(1 row)
统计到tbl已更新超过21亿次。
digoal=> select * from pg_stat_all_tables where relname='tbl';
-[ RECORD 1 ]-------+------------------------------
relid             | 16387
schemaname       | digoal
relname          | tbl
seq_scan          | 2
seq_tup_read       | 100000000
idx_scan          | 2136267592
idx_tup_fetch    | 2136267592
n_tup_ins          | 100278348
n_tup_upd          | 2136267592
n_tup_del          | 0
n_tup_hot_upd    | 2097129671
n_live_tup       | 50081001
n_dead_tup       | 135956
n_mod_since_analyze | 3111673
last_vacuum       |
last_autovacuum    | 2015-05-02 08:27:02.690159+08
last_analyze       |
last_autoanalyze | 2015-05-02 08:27:05.800603+08
vacuum_count       | 0
autovacuum_count | 580
analyze_count    | 0
autoanalyze_count | 579
可以导入测试结果，或者使用R进行分析。
digoal=> create table az(tps numeric);
CREATE TABLE
digoal=# \copy digoal.az from program 'awk ''NR>2 {print $4}'' /home/postgres/log'
COPY 3057
digoal=> select avg(tps),min(tps),max(tps),count(*) from az;
      avg       | min | max | count
--------------------+---------+---------+-------
60217.684494602552 | 27666.0 | 65708.7 |  3057
(1 row)
平均TPS：60217
最小TPS：27666
最大TPS：65708
图：

每一次tps下降都和checkpoint有关，因为检查点后第一次变脏的数据块需要写full page，所以会导致wal写buffer的压力(实际是连续写几个wal block size大小的能力，如果block_size=32K, wal_block_size=8K, 那么一个脏块需要写4个wal_block_size，假设wal fsync能力是每秒写10000个8K的块，那么检查点后的写操作如果都发生在不同的数据块上面，写WAL可能造成瓶颈，即tps可能降到2500以下。)，原因分析见：
http://blog.163.com/digoal@126/b ... 040201542103933969/
http://blog.163.com/digoal@126/b ... 770402015463252387/
http://blog.163.com/digoal@126/b ... 704020154651655783/
http://blog.163.com/digoal@126/b ... 704020154653422892/
http://blog.163.com/digoal@126/b ... 704020154811421484/
http://blog.163.com/digoal@126/b ... 704020154129958753/
关闭full page write的压力测试TPS散点图如下，检查点带来的影响消失了：
(用R画的散点图)

查询测试，测试时长超过8小时：
$ vi test.sql
\setrandom id 1 50000000
select * from tbl where id=:id;

nohup pgbench -M prepared -n -f test.sql -P 10 -c 38 -j 38 -T 30000000 >./log 2>&1 &
导入测试结果：
digoal=> create table az(tps numeric);
CREATE TABLE
digoal=> \c digoal postgres
You are now connected to database "digoal" as user "postgres".
digoal=# \copy digoal.az from program 'awk ''NR>2 {print $4}'' /home/postgres/log'
COPY 3027
digoal=> select avg(tps),min(tps),max(tps),count(*) from digoal.az;
avg | min | max | count
---------------------+----------+----------+-------
328895.445688800793 | 327336.0 | 330360.6 | 3027
(1 row)
平均TPS：328895
最小TPS：327336
最大TPS：330360
图：

查询的TPS比较平稳，维持在32.7万tps以上。

插入测试，测试时长超过8小时：
digoal=> drop table tbl;
digoal=> create table tbl(id serial primary key using index tablespace tbs_digoal_idx, info text, crt_time timestamptz default now()) tablespace tbs_digoal;

$ vi test.sql
insert into tbl(info) values ('hello world');

nohup pgbench -M prepared -n -f test.sql -P 10 -c 20 -j 20 -T 30000000 >./log 2>&1 &
导入测试结果：
约4小时后插入数据量如下：
digoal=> \dt+
               List of relations
Schema | Name | Type  | Owner  |  Size  | Description
--------+------+-------+--------+--------+-------------
digoal | tbl  | table | digoal | 69 GB  |
(1 rows)

digoal=> \di+
                     List of relations
Schema | Name | Type  | Owner  | Table | Size  | Description
--------+----------+-------+--------+-------+-------+-------------
digoal | tbl_pkey | index | digoal | tbl | 20 GB |
(1 row)

digoal=> create table az(tps numeric);
CREATE TABLE
digoal=> \c digoal postgres
You are now connected to database "digoal" as user "postgres".
digoal=# \copy digoal.az from program 'awk ''NR>2 {print $4}'' /home/postgres/log'
COPY 1385
digoal=# select avg(tps),min(tps),max(tps),count(*) from digoal.az;
      avg       | min | max | count
--------------------+---------+---------+-------
69839.050685920578 | 65283.7 | 72175.5 |  1385
(1 row)
平均TPS：70433
最小TPS：57417.4
最大TPS：75758.9
图：

检查点同样会对插入有一定影响，不过比对更新的影响小很多，因为并发的xlog full page write更少了(写完一个再扩展一个新的)。

[其他]
1. 使用PostgreSQL 9.5 重新测试更新，性能同样受到检查点的影响：
listen_addresses = '0.0.0.0'          # what IP address(es) to listen on;
port = 1922                            # (change requires restart)
max_connections = 100                # (change requires restart)
superuser_reserved_connections = 13    # (change requires restart)
unix_socket_directories = '.' # comma-separated list of directories
tcp_keepalives_idle = 60             # TCP_KEEPIDLE, in seconds;
tcp_keepalives_interval = 10          # TCP_KEEPINTVL, in seconds;
tcp_keepalives_count = 10             # TCP_KEEPCNT;
shared_buffers = 8GB                   # min 128kB
huge_pages = try                      # on, off, or try
maintenance_work_mem = 1GB             # min 1MB
autovacuum_work_mem = 1GB             # min 1MB, or -1 to use maintenance_work_mem
dynamic_shared_memory_type = posix    # the default is the first option
vacuum_cost_delay = 0                # 0-100 milliseconds
vacuum_cost_limit = 10000             # 1-10000 credits
bgwriter_delay = 10ms                # 10-10000ms between rounds
synchronous_commit = off             # synchronization level;
wal_sync_method = fdatasync          # the default is the first option
wal_buffers = 16MB                   # min 32kB, -1 sets based on shared_buffers
wal_writer_delay = 10ms       # 1-10000 milliseconds
checkpoint_timeout = 10min       # range 30s-1h
max_wal_size = 16GB                      # 配置为shared_buffers一倍, 对于DML频繁的数据库较好
min_wal_size = 512MB
random_page_cost = 2.0                # same scale as above
effective_cache_size = 64GB
log_destination = 'csvlog'             # Valid values are combinations of
logging_collector = on       # Enable capturing of stderr and csvlog
log_truncate_on_rotation = on          # If on, an existing log file with the
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose          # terse, default, or verbose messages
log_timezone = 'PRC'
autovacuum = on                # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_vacuum_scale_factor = 0.002  # fraction of table size before vacuum
autovacuum_analyze_scale_factor = 0.001 # fraction of table size before analyze
autovacuum_vacuum_cost_delay = 0       # default vacuum cost delay for
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'                      # locale for system error message
lc_monetary = 'C'                      # locale for monetary formatting
lc_numeric = 'C'                      # locale for number formatting
lc_time = 'C'                         # locale for time formatting
default_text_search_config = 'pg_catalog.english'
关于检查点为什么有如此大的影响，后面的文章再针对检查点源码分析一下原因。

[小结]
1. 测试结果反应了PostgreSQL checkpoint方面的不足之处，影响太大(实际上和checkpointer带来的IO关系不大，主要是这里的更新测试用例瞬间的FULL PAGE WRITE量太大，导致wal write buffer延迟变大而影响了TPS)。
有兴趣的朋友可查看我另外几篇文章的分析。
http://blog.163.com/digoal@126/b ... 040201542103933969/
http://blog.163.com/digoal@126/b ... 770402015463252387/
http://blog.163.com/digoal@126/b ... 704020154651655783/
http://blog.163.com/digoal@126/b ... 704020154653422892/
http://blog.163.com/digoal@126/b ... 704020154811421484/
http://blog.163.com/digoal@126/b ... 704020154129958753/
2. 如果你不想使用pgbench来测试PG, 也可以用python, 不过因为psycopg2目前不支持绑定变量, 所以效率会低很多.
原因见：
http://blog.163.com/digoal@126/b ... 770402015151653642/
import threading
import time
import psycopg2
import random

xs=12000
tps=dict()

class n_t(threading.Thread): # The timer class is derived from the class threading.Thread
  def __init__(self, num):
threading.Thread.__init__(self)
self.thread_num = num

  def run(self): #Overwrite run() method, put what you want the thread do here
conn = psycopg2.connect(database="digoal", user="digoal", password="digoal", host="/data01/pgdata/pg_root", port="1922")
curs = conn.cursor()
conn.autocommit=True

tps[self.thread_num] = dict()

f = open("/tmp/pg_test." + str(self.thread_num), "w")

for x in range(1,3001):
   start_t = time.time()
   for i in range(0,xs):
      curs.execute("update tbl set info=now(),crt_time=now() where id=%(id)s", {"id": random.randrange(1,50000000)})
   stop_t = time.time()
   tps[self.thread_num][x] = round(xs/(stop_t-start_t),2)
   res = "Round: " + str(x) + " TID: " + str(self.thread_num) + " Sec: " + str(round((stop_t-start_t),2)) + " tps: " + str(tps[self.thread_num][x])
   print >> f, res
   f.flush()

f.close()

def test():
  t_names = []
  for i in xrange(0,27):
t_names.append(n_t(i))

  for t in t_names:
t.start()

  return

if __name__ == '__main__':
  test()
3. http://www.slideshare.net/petere ... thon-and-postgresql

wanzai01 · 发表于 2015-6-4 14:20

如何防止PostgreSQL数据库膨胀。
经常看到有人说表又膨胀了，那么导致对象膨胀的常见原因有哪些呢？

1. 未开启autovacuum
对于未开启autovacuum的用户，同时又没有合理的自定义vacuum调度的话，表的垃圾没有及时回收，新的数据又不断进来，膨胀是必然的。（新的数据包括插入和更新，更新产生新版本的记录）

2. 开启了autovacuum, 但是各种原因导致回收不及时，并且新的数据又不断产生，从而导致膨胀。
回收不及时的原因：
2.1. IO差
当数据库非常繁忙时，如果IO比较差，会导致回收垃圾变慢，从而导致膨胀。
这种一般出现在数据库中存在非常巨大的表，并且这些表在执行whole table vacuum (prevent xid wrapped, 或当表的年龄大于vacuum_freeze_table_age时会全表扫)，因此产生大量IO，这期间很容易导致自身或其他表膨胀。

2.2. autovacuum触发较迟
什么情况会触发autovacuum呢?
* A table needs to be vacuumed if the number of dead tuples exceeds a
* threshold.  This threshold is calculated as
*
* threshold = vac_base_thresh + vac_scale_factor * reltuples
如果没有设置表级别的autovacuum thresh和factor,那么默认使用参数文件配置的值。如下：
int                   autovacuum_vac_thresh;  // 默认50
double       autovacuum_vac_scale;  // 默认0.2
也就是说dead tuple达到约为表的20%时，才触发autovacuum。
然后回收又需要一定的时间，所以最终表的膨胀应该是超过20%的。

2.3. 所有worker繁忙，某些表产生的垃圾如果超过阈值，但是在此期间没有worker可以为它处理垃圾回收的事情。导致可能发生膨胀。
可fork的worker进程个数是参数autovacuum_max_workers决定的，初始化autovacuum共享内存时已固定了它的最大进程数。见代码，
src/backend/postmaster/autovacuum.c
/*
* AutoVacuumShmemInit
*             Allocate and initialize autovacuum-related shared memory
*/
void
AutoVacuumShmemInit(void)
{
      bool          found;

      AutoVacuumShmem = (AutoVacuumShmemStruct *)
            ShmemInitStruct("AutoVacuum Data",
                                             AutoVacuumShmemSize(),
                                             &found);

      if (!IsUnderPostmaster)
      {
            WorkerInfo    worker;
            int                   i;

            Assert(!found);

            AutoVacuumShmem->av_launcherpid = 0;
            dlist_init(&AutoVacuumShmem->av_freeWorkers);
            dlist_init(&AutoVacuumShmem->av_runningWorkers);
            AutoVacuumShmem->av_startingWorker = NULL;

            worker = (WorkerInfo) ((char *) AutoVacuumShmem +
                                                         MAXALIGN(sizeof(AutoVacuumShmemStruct)));

            /* initialize the WorkerInfo free list */
            for (i = 0; i < autovacuum_max_workers; i++)
                     dlist_push_head(&AutoVacuumShmem->av_freeWorkers,
                                                      &worker[i].wi_links);
      }
      else
            Assert(found);
}
如果数据库的表很多，而且都比较大，那么当需要vacuum的表超过了配置autovacuum_max_workers的数量，某些表就要等待空闲的worker。这个阶段就容易出现表的膨胀。
以前的PostgreSQL版本，一个数据库同一时间只会起一个worker进程，现在的版本已经没有这个限制了：
src/backend/postmaster/autovacuum.c
* Note that there can be more than one worker in a database concurrently.
* They will store the table they are currently vacuuming in shared memory, so
* that other workers avoid being blocked waiting for the vacuum lock for that
* table.  They will also reload the pgstats data just before vacuuming each
* table, to avoid vacuuming a table that was just finished being vacuumed by
* another worker and thus is no longer noted in shared memory.  However,
* there is a window (caused by pgstat delay) on which a worker may choose a
* table that was already vacuumed; this is a bug in the current design.
所以如果你的PostgreSQL集群有很多数据库的话，可能需要更多的worker进程来支撑。
另外需要注意一点，worker进程在工作时，每个worker最多会消耗的内存由以下参数决定：
#maintenance_work_mem = 64MB          # min 1MB
#autovacuum_work_mem = -1             # min 1MB, or -1 to use maintenance_work_mem
所以worker进程越多，内存需求量也越大。

2.4. 数据库中存在持有事务Exclusive锁的长事务
PostgreSQL目前存在一个非常严重的缺陷，当数据库中存在持有事务Exclusive锁的长事务，事务过程中产生垃圾的话，无法回收，导致数据库膨胀。
原因见：
src/backend/utils/time/tqual.c
/*
* HeapTupleSatisfiesVacuum
*
*    Determine the status of tuples for VACUUM purposes.  Here, what
*    we mainly want to know is if a tuple is potentially visible to *any*
*    running transaction.  If so, it can't be removed yet by VACUUM.
*
* OldestXmin is a cutoff XID (obtained from GetOldestXmin()).  Tuples
* deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
* still be visible to some open transaction, so we can't remove them,
* even if we see that the deleting transaction has committed.
*/
HTSV_Result
HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
                                             Buffer buffer)
{
后面通过测试来展示。

2.5. 开启了autovacuum_vacuum_cost_delay。
在开启了autovacuum_vacuum_cost_delay后，会使用基于成本的垃圾回收，这个可以有利于降低VACUUM带来的IO影响，但是对于IO没有问题的系统，就没有必要开启autovacuum_vacuum_cost_delay，因为这会使得垃圾回收的时间变长。
当autovacuum进程达到autovacuum_vacuum_cost_limit后，会延迟autovacuum_vacuum_cost_delay后继续。
      /*
      * Adjust cost limit of each active worker to balance the total of cost
      * limit to autovacuum_vacuum_cost_limit.
      */
      cost_avail = (double) vac_cost_limit / vac_cost_delay;
      dlist_foreach(iter, &AutoVacuumShmem->av_runningWorkers)
      {
            WorkerInfo    worker = dlist_container(WorkerInfoData, wi_links, iter.cur);

            if (worker->wi_proc != NULL &&
                     worker->wi_dobalance &&
                     worker->wi_cost_limit_base > 0 && worker->wi_cost_delay > 0)
            {
                     int                   limit = (int)
                     (cost_avail * worker->wi_cost_limit_base / cost_total);

                     /*
                     * We put a lower bound of 1 on the cost_limit, to avoid division-
                     * by-zero in the vacuum code.  Also, in case of roundoff trouble
                     * in these calculations, let's be sure we don't ever set
                     * cost_limit to more than the base value.
                     */
                     worker->wi_cost_limit = Max(Min(limit,
                                                                                    worker->wi_cost_limit_base),
                                                                              1);
            }
限制计算方法由另外几个参数决定：
包括在SHARED BUFFER中命中的块，未命中的块，非脏块的额外成本。
vacuum_cost_page_hit (integer)
The estimated cost for vacuuming a buffer found in the shared buffer cache. It represents the cost to lock the buffer pool, lookup the shared hash table and scan the content of the page. The default value is one.

vacuum_cost_page_miss (integer)
The estimated cost for vacuuming a buffer that has to be read from disk. This represents the effort to lock the buffer pool, lookup the shared hash table, read the desired block in from the disk and scan its content. The default value is 10.

vacuum_cost_page_dirty (integer)
The estimated cost charged when vacuum modifies a block that was previously clean. It represents the extra I/O required to flush the dirty block out to disk again. The default value is 20.
对于IO没有问题的系统，不建议设置autovacuum_vacuum_cost_limit。

2.6.autovacuum launcher process 唤醒时间太长
唤醒时间由参数autovacuum_naptime决定，autovacuum launcher进程负责告诉postmaster需要fork worker进程来进行垃圾回收，但是如果autovacuum launcher进程一直在睡觉的话，那完蛋了，有垃圾了它还在睡觉，那不就等着膨胀吗？
另外还有一个限制在代码中，也就是说不能小于MIN_AUTOVAC_SLEEPTIME 100毫秒：
src/backend/postmaster/autovacuum.c
/* the minimum allowed time between two awakenings of the launcher */
#define MIN_AUTOVAC_SLEEPTIME 100.0             /* milliseconds */
......
      /* The smallest time we'll allow the launcher to sleep. */
      if (nap->tv_sec <= 0 && nap->tv_usec <= MIN_AUTOVAC_SLEEPTIME * 1000)
      {
            nap->tv_sec = 0;
            nap->tv_usec = MIN_AUTOVAC_SLEEPTIME * 1000;
      }
......
            /*
               * Wait until naptime expires or we get some type of signal (all the
               * signal handlers will wake us by calling SetLatch).
               */
            rc = WaitLatch(&MyProc->procLatch,
                                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                       (nap.tv_sec * 1000L) + (nap.tv_usec / 1000L));
这个后面我会进行测试来展示它。

2.7 批量删除或批量更新，
例如对于一个10GB的表，一条SQL或一个事务中删除或更新9GB的数据，这9GB的数据必须在事务结束后才能进行垃圾回收，无形中增加了膨胀的可能。

2.8 大量的非HOT更新，会导致索引膨胀，对于BTREE索引来说，整个索引页没有任何引用才能被回收利用，因此索引比较容易膨胀。

[测试]
测试过程使用如下参数：
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 10
autovacuum_naptime = 1
autovacuum_vacuum_threshold = 5
autovacuum_analyze_threshold = 5
autovacuum_vacuum_scale_factor = 0.002
autovacuum_analyze_scale_factor = 0.001
autovacuum_vacuum_cost_delay = 0
测试数据：
postgres=# create table tbl (id int primary key, info text, crt_time timestamp);
CREATE TABLE
postgres=# insert into tbl select generate_series(1,2000000),md5(random()::text),clock_timestamp();
INSERT 0 2000000
postgres=# \dt+ tbl
                  List of relations
Schema | Name | Type  |  Owner |  Size  | Description
--------+------+-------+----------+--------+-------------
public | tbl  | table | postgres | 146 MB |
(1 row)

postgres=# \di+ tbl_pkey
                     List of relations
Schema | Name | Type  |  Owner | Table | Size  | Description
--------+----------+-------+----------+-------+-------+-------------
public | tbl_pkey | index | postgres | tbl | 43 MB |
(1 row)
测试脚本：
一次最多更新25万条
$ vi test.sql
\setrandom id 1 2000000
update tbl set info=md5(random()::text) where id between :id-250000 and :id+250000;
测试两个东西：
1. 数据库中存在持有事务Exclusive锁的长事务，这个事务时间段内，数据库产生的垃圾无法被回收。
postgres@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 1 -j 1 -T 500000
观察日志
postgres@db-172-16-3-150-> tail -f -n 1 postgresql-2015-04-29_174535.csv|grep removable
tuples: 500001 removed, 1710872 remain, 0 are dead but not yet removable
tuples: 0 removed, 478 remain, 3 are dead but not yet removable
tuples: 499647 removed, 1844149 remain, 0 are dead but not yet removable
tuples: 500001 removed, 1830118 remain, 0 are dead but not yet removable
tuples: 290450 removed, 1865527 remain, 0 are dead but not yet removable
现在看没有问题，接下来我开一个持有事务Exclusive锁的事务，
postgres=# begin;
BEGIN
postgres=# select txid_current();
txid_current
--------------
314030959
(1 row)
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
         6073
(1 row)
-[ RECORD 1 ]------+--------------
locktype          | virtualxid
database          |
relation          |
page             |
tuple             |
virtualxid       | 4/180
transactionid    |
classid          |
objid             |
objsubid          |
virtualtransaction | 4/180
pid             | 6073
mode             | ExclusiveLock
granted          | t
fastpath          | t
-[ RECORD 2 ]------+--------------
locktype          | transactionid
database          |
relation          |
page             |
tuple             |
virtualxid       |
transactionid    | 314030959
classid          |
objid             |
objsubid          |
virtualtransaction | 4/180
pid             | 6073
mode             | ExclusiveLock
granted          | t
fastpath          | f
这个事务在另一个会话中通过txid_current_snapshot可以看到它是一个未结束的事务。
postgres=# select * from txid_current_snapshot();
-[ RECORD 1 ]---------+------------------------------
txid_current_snapshot | 314030959:314030981:314030959
接下来看看日志：
不可回收的行在不断的增长。
tuples: 0 removed, 2391797 remain, 500001 are dead but not yet removable
tuples: 0 removed, 484 remain, 9 are dead but not yet removable
tuples: 0 removed, 2459288 remain, 500001 are dead but not yet removable
tuples: 0 removed, 484 remain, 9 are dead but not yet removable
tuples: 0 removed, 2713489 remain, 760235 are dead but not yet removable
tuples: 0 removed, 487 remain, 12 are dead but not yet removable
tuples: 0 removed, 2572991 remain, 760235 are dead but not yet removable
tuples: 0 removed, 487 remain, 12 are dead but not yet removable
tuples: 0 removed, 2849378 remain, 760235 are dead but not yet removable
tuples: 0 removed, 487 remain, 12 are dead but not yet removable
tuples: 0 removed, 3023757 remain, 760235 are dead but not yet removable
tuples: 0 removed, 487 remain, 12 are dead but not yet removable
tuples: 0 removed, 3135900 remain, 1137469 are dead but not yet removable
tuples: 0 removed, 490 remain, 15 are dead but not yet removable
索引和表也明显膨胀了：
postgres=# \dt+ tbl
                  List of relations
Schema | Name | Type  |  Owner |  Size  | Description
--------+------+-------+----------+--------+-------------
public | tbl  | table | postgres | 781 MB |
(1 row)

postgres=# \di+ tbl_pkey
                        List of relations
Schema | Name | Type  |  Owner | Table |  Size  | Description
--------+----------+-------+----------+-------+--------+-------------
public | tbl_pkey | index | postgres | tbl | 308 MB |
(1 row)
关闭这个事务后：这个事务期间没有被回收的垃圾可以正常回收
停之前，另外再开一个
postgres=# begin;
BEGIN
postgres=# select txid_current();
txid_current
--------------
314031042
(1 row)
然后停前面那个
postgres=# end;
COMMIT
这里主要为了说明，这样的连续事务，不会影响上一个事务过程造成影响的未回收的垃圾。
tuples: 0 removed, 481 remain, 6 are dead but not yet removable
tuples: 13629196 removed, 2515757 remain, 500001 are dead but not yet removable
tuples: 0 removed, 4845701 remain, 1000002 are dead but not yet removable
tuples: 0 removed, 7146782 remain, 1500003 are dead but not yet removable
后面启动的那个也停掉，垃圾完全回收了：
postgres=# end;
COMMIT
tuples: 7183691 removed, 11252550 remain, 0 are dead but not yet removable
tuples: 500001 removed, 6234858 remain, 0 are dead but not yet removable
但是表和索引已经膨胀了，无法收缩，除非使用rewrite table（vacuum full, cluster）或者pg_reorg,pg_repack这样的工具。
我再举一个实际可能存在的例子，例如持续的并发批量更新，也可能导致膨胀：
例如我有一个表包含100万条记录，分成10个进程，每个进程批量更新其中的10万条，并且持续不断的更新。
为什么说这样操作会引起膨胀呢，因为worker process 最小粒度是表级别的，同一张表同一时间只有一个进程在回收垃圾。这种场景会产生三个问题：
1. 瞬时产生垃圾的速度可能超过回收的速度，
2. 产生新TUPLE版本的需求超过FSM的剩余空间，
3. 回收过程中（其他进程可能会启动并发的更新，持有事务排他锁）会遇到不可回收的问题，就是前面这个例子提到的问题。
这几种原因会导致扩展新的数据块可能性变大，即膨胀。
测试例子：
postgres=# truncate tbl;
TRUNCATE TABLE
postgres=# insert into tbl select generate_series(1,1000000),md5(random()::text),clock_timestamp();
INSERT 0 1000000
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 73 MB
Description |

postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 21 MB
Description |

$ vi t1.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=1 and id<100000;
$ vi t2.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=100001 and id<200000;
$ vi t3.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=200001 and id<300000;
$ vi t4.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=300001 and id<400000;
$ vi t5.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=400001 and id<500000;
$ vi t6.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=500001 and id<600000;
$ vi t7.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=600001 and id<700000;
$ vi t8.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=700001 and id<800000;
$ vi t9.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=800001 and id<900000;
$ vi t10.sql
update tbl set info=info,crt_time=clock_timestamp() where id >=900001 and id<=1000000;

pgbench -M prepared -n -r -f ./t1.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t2.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t3.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t4.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t5.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t6.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t7.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t8.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t9.sql -c 1 -j 1 -T 500000 &
pgbench -M prepared -n -r -f ./t10.sql -c 1 -j 1 -T 500000 &

观察到出现了不可回收的垃圾
tuples: 0 removed, 2049809 remain, 999991 are dead but not yet removable
tuples: 501373 removed, 2176172 remain, 999991 are dead but not yet removable
tuples: 1603158 removed, 2517367 remain, 899562 are dead but not yet removable
tuples: 405093 removed, 2647780 remain, 899992 are dead but not yet removable
tuples: 1100546 removed, 2724724 remain, 899562 are dead but not yet removable
tuples: 528200 removed, 2864735 remain, 1141307 are dead but not yet removable
tuples: 981628 removed, 2757909 remain, 933307 are dead but not yet removable

膨胀
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 554 MB
Description |

postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 114 MB
Description |
如果产生新TUPLE版本的需求超过FSM的剩余空间，还会继续膨胀下去。
这个问题的改进非常简单，将批量更新的粒度降低，即单个事务时间缩短，可以降低事务排他锁持有时间，减少not yet removable的情况，同时事务变小后，单个事务对剩余空间的需求量也变小了，所以不需要扩展数据块。就不会膨胀。
例子：
postgres=# create sequence seq cache 10;
CREATE SEQUENCE
postgres=# vacuum full tbl;
VACUUM

postgres@db-172-16-3-150-> vi test.sql
update tbl set info=info,crt_time=clock_timestamp() where id=mod(nextval('seq'),2000001);

为了提高速度，我这里将nextval改为immutable, 让以上SQL可以走索引扫描（生产环境请勿模仿）。
postgres=# alter function nextval(regclass) immutable;

postgres@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 20 -j 10 -T 500000

因为事务很短，所以只能看到少量的not yet removable，
同时因为一次更新带来的新版本需要的空间也比较小，所以不会超出FSM中的剩余空间，不需要EXTEND BLOCK。
tuples: 11116 removed, 977673 remain, 591 are dead but not yet removable
tuples: 12050 removed, 978437 remain, 979 are dead but not yet removable
tuples: 12687 removed, 981375 remain, 227 are dead but not yet removable
tuples: 12911 removed, 978912 remain, 831 are dead but not yet removable
tuples: 7 removed, 475 remain, 0 are dead but not yet removable
tuples: 13133 removed, 979761 remain, 522 are dead but not yet removable
tuples: 14419 removed, 977651 remain, 1077 are dead but not yet removable
tuples: 12111 removed, 978558 remain, 700 are dead but not yet removable

半小时后，并未出现膨胀
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 75 MB
Description |

postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 21 MB
Description |

2. autovacuum launcher process 唤醒时间太长会影响垃圾回收。
调整autovacuum_naptime = 1000， 1000秒的睡觉时间，看看怎么完蛋的吧。
pg_ctl reload
postgres=# vacuum full tbl;
VACUUM
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 145 MB
Description |

postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 43 MB
Description |

postgres@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 1 -j 1 -T 500000
现在不看日志，一段时间之后，你看看pg_stat_all_tables：
postgres@db-172-16-3-150-> psql
psql (9.4.1)
Type "help" for help.
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_all_tables where relname='tbl';
-[ RECORD 1 ]-------+------------------------------
relid             | 60511
schemaname       | public
relname          | tbl
seq_scan          | 6
seq_tup_read       | 12082648
idx_scan          | 116
idx_tup_fetch    | 49232974
n_tup_ins          | 2000000
n_tup_upd          | 50732969
n_tup_del          | 0
n_tup_hot_upd    | 81869
n_live_tup       | 2328301
n_dead_tup       | 7895450
n_mod_since_analyze | 7825343
last_vacuum       |
last_autovacuum    | 2015-04-29 18:02:45.325102+08
last_analyze       |
last_autoanalyze | 2015-04-29 18:02:45.596096+08
vacuum_count       | 0
autovacuum_count | 247
analyze_count    | 0
autoanalyze_count | 91
表已经膨胀了：
postgres=# \dt+ tbl
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl
Type       | table
Owner    | postgres
Size       | 393 MB
Description |
postgres=# \di+ tbl_pkey
List of relations
-[ RECORD 1 ]---------
Schema    | public
Name       | tbl_pkey
Type       | index
Owner    | postgres
Table    | tbl
Size       | 115 MB
Description |

[小结]
通过上面的分析，我们应该如何减少或避免PostgreSQL数据膨胀呢？
1. 一定要开启autovacuum。
2. 提高系统的IO能力，越高越好。
3. 调整触发阈值，让触发阈值和记录数匹配。调小autovacuum_vacuum_scale_factor和autovacuum_analyze_scale_factor。比如我想在有1万条垃圾记录后就触发垃圾回收，那么对于一个1000万的表来说，我应该把autovacuum_vacuum_scale_factor调到千分之一即0.001，而autovacuum_analyze_scale_factor应该调到0.0005。
4. 增加autovacuum_max_workers，同时增加autovacuum_work_mem，同时增加系统内存。
例如对于有大量表需要频繁更新的数据库集群，可以将autovacuum_max_workers调整为与CPU核数一致，并将autovacuum_work_mem调整为2GB，同时需要确保系统预留的内存大于autovacuum_max_workers*autovacuum_work_mem。
5. 应用程序设计时，尽量避免持有事务Exclusive锁的长事务（DDL,DML都会持有事务Exclusive锁）
6. 对于IO没有问题的系统，关闭autovacuum_vacuum_cost_delay。
7. 调整autovacuum_naptime参数到最低，如果还是唤醒时间太长，可以调整代码中的限制，例如改为1毫秒：
#define MIN_AUTOVAC_SLEEPTIME 1.0             /* milliseconds */
8. 应用程序设计时，避免使用大批量的更新，删除操作，可以切分为多个事务进行。
9. 使用大的数据块，对于现代的硬件水平，32KB是比较好的选择，fillfactor实际上不需要太关注，100就可以了，调低它其实没有必要，因为数据库总是有垃圾，也就是说每个块在被更新后实际上都不可能是满的。
10. 万一真的膨胀了，可以通过table rewrite来回收(如vacuum full, cluster)，但是需要迟排他锁。建议使用pg_reorg或者pg_repack来回收，实际上用到了交换 filenode可以缩短需要持有排他锁的时间。

[参考]
1. src/backend/postmaster/autovacuum.c

wanzai01 · 发表于 2015-6-4 14:21

为什么数据库中存在长事务时，有可能导致垃圾膨胀?
具体是什么原因呢？
看看PostgreSQL垃圾回收代码就知道原因了。
回收垃圾的函数其中之一：
src/backend/commands/vacuumlazy.c
/*
*    lazy_scan_heap() -- scan an open heap relation
*
*             This routine prunes each page in the heap, which will among other
*             things truncate dead tuples to dead line pointers, defragment the
*             page, and set commit status bits (see heap_page_prune).  It also builds
*             lists of dead tuples and pages with free space, calculates statistics
*             on the number of live tuples in the heap, and marks pages as
*             all-visible if appropriate.  When done, or when we run low on space for
*             dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
*             to reclaim dead line pointers.
*
*             If there are no indexes then we can reclaim line pointers on the fly;
*             dead line pointers need only be retained until all index pointers that
*             reference them have been killed.
*/
static void
lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
                        Relation *Irel, int nindexes, bool scan_all)
{
...
      for (blkno = 0; blkno < nblocks; blkno++)
      {
            Buffer       buf;
            Page          page;
            OffsetNumber offnum,
                                    maxoff;
            bool          tupgone,  // 表示TUPLE是否可以回收
......
            /*
               * Note: If you change anything in the loop below, also look at
               * heap_page_is_all_visible to see if that needs to be changed.
               */
            for (offnum = FirstOffsetNumber;
                     offnum <= maxoff;
                     offnum = OffsetNumberNext(offnum))
            {
                     ItemId       itemid;
......
                     tupgone = false;

                     switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))  // 判断TUPLE状态，是否满足VACUUM条件
                     {
                              case HEAPTUPLE_DEAD:
......
                                    if (HeapTupleIsHotUpdated(&tuple) ||
                                             HeapTupleIsHeapOnly(&tuple))
                                             nkeep += 1;
                                    else
                                             tupgone = true; /* we can delete the tuple */
                                    all_visible = false;
                                    break;
......
                              case HEAPTUPLE_RECENTLY_DEAD: // 这就表示在最老的活动事务之后产生的垃圾, 无法回收

                                    /*
                                       * If tuple is recently deleted then we must not remove it
                                       * from relation.
                                       */
                                    nkeep += 1;
                                    all_visible = false;
                                    break;

我们看看HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf)这个用于判断TUPLE是否满足VACUUM条件的函数
OldestXmin如何获得？
src/backend/commands/vacuum.c
/*
* vacuum_set_xid_limits() -- compute oldest-Xmin and freeze cutoff points
*
* The output parameters are:
* - oldestXmin is the cutoff value used to distinguish whether tuples are
*    DEAD or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
* - freezeLimit is the Xid below which all Xids are replaced by
*    FrozenTransactionId during vacuum.
* - xidFullScanLimit (computed from table_freeze_age parameter)
*    represents a minimum Xid value; a table whose relfrozenxid is older than
*    this will have a full-table vacuum applied to it, to freeze tuples across
*    the whole table.  Vacuuming a table younger than this value can use a
*    partial scan.
* - multiXactCutoff is the value below which all MultiXactIds are removed from
*    Xmax.
* - mxactFullScanLimit is a value against which a table's relminmxid value is
*    compared to produce a full-table vacuum, as with xidFullScanLimit.
*
* xidFullScanLimit and mxactFullScanLimit can be passed as NULL if caller is
* not interested.
*/
void
vacuum_set_xid_limits(Relation rel,
                                       int freeze_min_age,
                                       int freeze_table_age,
                                       int multixact_freeze_min_age,
                                       int multixact_freeze_table_age,
                                       TransactionId *oldestXmin,
                                       TransactionId *freezeLimit,
                                       TransactionId *xidFullScanLimit,
                                       MultiXactId *multiXactCutoff,
                                       MultiXactId *mxactFullScanLimit)
{
......
      /*
      * We can always ignore processes running lazy vacuum.  This is because we
      * use these values only for deciding which tuples we must keep in the
      * tables.  Since lazy vacuum doesn't write its XID anywhere, it's safe to
      * ignore it.  In theory it could be problematic to ignore lazy vacuums in
      * a full vacuum, but keep in mind that only one vacuum process can be
      * working on a particular table at any time, and that each vacuum is
      * always an independent transaction.
      */
      *oldestXmin = GetOldestXmin(rel, true);  // 设置当前系统中最老的未提交的事务号

src/backend/storage/ipc/procarray.c
/*
* GetOldestXmin -- returns oldest transaction that was running
*                                     when any current transaction was started.
*
* If rel is NULL or a shared relation, all backends are considered, otherwise
* only backends running in this database are considered.
*
* If ignoreVacuum is TRUE then backends with the PROC_IN_VACUUM flag set are
* ignored.
*
* This is used by VACUUM to decide which deleted tuples must be preserved in
* the passed in table. For shared relations backends in all databases must be
* considered, but for non-shared relations that's not required, since only
* backends in my own database could ever see the tuples in them. Also, we can
* ignore concurrently running lazy VACUUMs because (a) they must be working
* on other tables, and (b) they don't need to do snapshot-based lookups.
*
* This is also used to determine where to truncate pg_subtrans.  For that
* backends in all databases have to be considered, so rel = NULL has to be
* passed in.
*
* Note: we include all currently running xids in the set of considered xids.
* This ensures that if a just-started xact has not yet set its snapshot,
* when it does set the snapshot it cannot set xmin less than what we compute.
* See notes in src/backend/access/transam/README.
*
* Note: despite the above, it's possible for the calculated value to move
* backwards on repeated calls. The calculated value is conservative, so that
* anything older is definitely not considered as running by anyone anymore,
* but the exact value calculated depends on a number of things. For example,
* if rel = NULL and there are no transactions running in the current
* database, GetOldestXmin() returns latestCompletedXid. If a transaction
  //  latestCompletedXid集群中最新的已提交事务号，这以后的所有事务
// 即使在后来变成垃圾, 也不回收。vacuum不考虑其他未提交事务是否需要看到这些垃圾数据(隔离级别为repeatable read级别及以上的可能会读到)
* begins after that, its xmin will include in-progress transactions in other
* databases that started earlier, so another call will return a lower value.
* Nonetheless it is safe to vacuum a table in the current database with the
* first result.  There are also replication-related effects: a walsender
* process can set its xmin based on transactions that are no longer running
* in the master but are still being replayed on the standby, thus possibly
* making the GetOldestXmin reading go backwards.  In this case there is a
* possibility that we lose data that the standby would like to have, but
* there is little we can do about that --- data is only protected if the
* walsender runs continuously while queries are executed on the standby.
* (The Hot Standby code deals with such cases by failing standby queries
* that needed to access already-removed data, so there's no integrity bug.)
* The return value is also adjusted with vacuum_defer_cleanup_age, so
* increasing that setting on the fly is another easy way to make
* GetOldestXmin() move backwards, with no consequences for data integrity.
*/
TransactionId
GetOldestXmin(Relation rel, bool ignoreVacuum)
{
      ProcArrayStruct *arrayP = procArray;
      TransactionId result;  // 返回结果
......
      /*
      * We initialize the MIN() calculation with latestCompletedXid + 1. This
      * is a lower bound for the XIDs that might appear in the ProcArray later,
      * and so protects us against overestimating the result due to future
      * additions.
      */
      result = ShmemVariableCache->latestCompletedXid;
      Assert(TransactionIdIsNormal(result));
      TransactionIdAdvance(result);

src/include/access/transam.h
/* in transam/varsup.c */
extern PGDLLIMPORT VariableCache ShmemVariableCache;

/*
* VariableCache is a data structure in shared memory that is used to track
* OID and XID assignment state.  For largely historical reasons, there is
* just one struct with different fields that are protected by different
* LWLocks.
*
* Note: xidWrapLimit and oldestXidDB are not "active" values, but are
* used just to generate useful messages when xidWarnLimit or xidStopLimit
* are exceeded.
*/
typedef struct VariableCacheData
{
      /*
      * These fields are protected by OidGenLock.
      */
      Oid                   nextOid;             /* next OID to assign */
      uint32       oidCount;             /* OIDs available before must do XLOG work */

      /*
      * These fields are protected by XidGenLock.
      */
      TransactionId nextXid;       /* next XID to assign */

      TransactionId oldestXid;       /* cluster-wide minimum datfrozenxid */
      TransactionId xidVacLimit;    /* start forcing autovacuums here */
      TransactionId xidWarnLimit; /* start complaining here */
      TransactionId xidStopLimit; /* refuse to advance nextXid beyond here */
      TransactionId xidWrapLimit; /* where the world ends */
      Oid                   oldestXidDB; /* database with minimum datfrozenxid */

      /*
      * These fields are protected by ProcArrayLock.
      */
      TransactionId latestCompletedXid;    /* newest XID that has committed or  这就是集群中最新的已提交事务号
                                                                              * aborted */
} VariableCacheData;

typedef VariableCacheData *VariableCache;
可以看到，PostgreSQL垃圾回收时，只是判断TUPLE是否是在垃圾回收开始后产生的，如果是，那么就不回收。
类似使用这个函数的效果：
txid_current_snapshot()
Snapshot Components
Name Description
xmin Earliest transaction ID (txid) that is still active. All earlier transactions will either be committed and visible, or rolled back and dead.
// 执行vacuum的进程开始时，根据当时snapshot取到这个XMIN，在扫描垃圾TUPLE时，大于这个XMIN的垃圾即是HEAPTUPLE_RECENTLY_DEAD

xmax First as-yet-unassigned txid. All txids greater than or equal to this are not yet started as of the time of the snapshot, and thus invisible.
xip_list Active txids at the time of the snapshot. The list includes only those active txids between xmin and xmax; there might be active txids higher than xmax. A txid that is xmin <= txid < xmax and not in this list was already completed at the time of the snapshot, and thus either visible or dead according to its commit status. The list does not include txids of subtransactions.
txid_snapshot's textual representation is xmin:xmax:xip_list. For example 10:20:10,14,15 means xmin=10, xmax=20, xip_list=10, 14, 15.
要解决这个容易膨胀的问题，我们需要知道数据库中存在的最老的未提交的repeatable read或serializable隔离级别的事务号。
用这个事务号作为判断HEAPTUPLE_RECENTLY_DEAD的依据。
例如，当A库中存在级别repeatable read或serializable的最小未结束事务号为Xa, 那么A库中Xa后产生的垃圾不能回收，但是其他库Xa后产生的垃圾能否回收和其他库中的最小repeatable read或serializable未结束事务号有关，和A库无关。

[参考]
1. http://blog.163.com/digoal@126/b ... 402015329115636287/
2. src/include/storage/proc.h
/*
* Prior to PostgreSQL 9.2, the fields below were stored as part of the
* PGPROC.  However, benchmarking revealed that packing these particular
* members into a separate array as tightly as possible sped up GetSnapshotData
* considerably on systems with many CPU cores, by reducing the number of
* cache lines needing to be fetched.  Thus, think very carefully before adding
* anything else here.
*/
typedef struct PGXACT
{
      TransactionId xid;                   /* id of top-level transaction currently being
                                                               * executed by this proc, if running and XID
                                                               * is assigned; else InvalidTransactionId */

      TransactionId xmin;                   /* minimal running XID as it was when we were
                                                               * starting our xact, excluding LAZY VACUUM:
                                                               * vacuum must not remove tuples deleted by
                                                               * xid >= xmin ! */

      uint8          vacuumFlags; /* vacuum-related flags, see above */
      bool          overflowed;
      bool          delayChkpt;          /* true if this proc delays checkpoint start;
                                                               * previously called InCommit */

      uint8          nxids;
} PGXACT;

wanzai01 · 发表于 2015-6-4 14:24

为流式数据诞生的Block range index索引访问方法：
PostgreSQL 9.5引入的一个全新的索引访问方法BRIN（block range index），这个索引存储了表的连续数据块区间以及对应的数据取值范围。
比如一张表有1000个数据块，我们建议一个BRIN在ID（假设这个表有ID字段）上的索引。
BRIN默认是每128个连续数据块区间存储一个字段取值的区间，所以这个索引的信息量是将1000个数据块划分为几个连续的128个块的区间，然后存储每个区间ID值的取值范围。
很显然，BRIN索引时lossy索引（即有损索引），那么我们并不能直接从索引中精确匹配要查询的记录，但是通过索引我们可以将查询范围缩小到最小128个连续的数据块（假设我们要找的值落在这个区间）。
以上是BRIN大概的原理，那么BRIN可以用在什么场景呢？
一个非常好的场景是流式日志数据，比如用户行为，大批量的数据按时间顺序不停的插入数据表。
我们如果要按照时间来访问这样的数据，以往我们需要创建BTREE索引，可以范围查询或者精确匹配。但是BTREE索引需要存储的信息量较大，如果数据量很大，索引也很庞大。
BRIN的话，索引可以变得很小，而且因为数据是按照时间顺序插入的，所以BRIN的信息量也很大，因为每个连续的数据块区间存储的时间范围和其他连续的数据块区间独立性很好，即不会出现大量数据交叉，如果有大量较差，那么使用BRIN检索还不如全表扫描。
BRIN可认为是全表扫描的切片，如果数据值分布和物理值分布的相关性很好，那么BRIN无疑是非常好的选择。
这里说到的相关性，大家可以参考统计学的知识，或者参考我之前写过的一篇文章。
http://blog.163.com/digoal@126/b ... 040201512810112541/
接下来我们测试一下BRIN对于相关性好和相关性差的数据，以及他们的性能。
postgres=# create table t1(id int,info text);
CREATE TABLE
postgres=# create table t2(id int,info text);
CREATE TABLE
postgres=# insert into t1 select generate_series(1,10000000),md5(random()::text);
INSERT 0 10000000
以下数据ID和物理存储相关性非常差。
postgres=# insert into t2 select id,md5(random()::text) from generate_series(1,10000000) as t(id) order by random();
INSERT 0 10000000
postgres=# analyze t1;
ANALYZE
postgres=# analyze t2;
ANALYZE
查询他们的相关性。显然T2表的物理存储和实际值顺序相关性很差。
postgres=# select correlation from pg_stats where tablename='t1' and attname='id';
correlation
-------------
         1
(1 row)
postgres=# select correlation from pg_stats where tablename='t2' and attname='id';
correlation
-------------
  0.00805771
(1 row)
创建索引，创建索引的速度明显比BTREE索引快，因为BRIN只需要存储值区间，瘦得很。
postgres=# create index idx_t1_id on t1 using brin (id);
CREATE INDEX
postgres=# create index idx_t2_id on t2 using brin (id);
CREATE INDEX
我们看看索引的大小和表的大小，从BRIN的原理我们可以想象索引肯定很小，表650MB，索引才192K。
postgres=# \di+
                        List of relations
Schema | Name | Type  |  Owner | Table |  Size  | Description
--------+-----------+-------+----------+-------+--------+-------------
public | idx_t1_id | index | postgres | t1 | 192 kB |
public | idx_t2_id | index | postgres | t2 | 192 kB |
(2 rows)
postgres=# \dt+ t1
                  List of relations
Schema | Name | Type  |  Owner |  Size  | Description
--------+------+-------+----------+--------+-------------
public | t1 | table | postgres | 650 MB |
(1 row)
postgres=# \dt+ t2
                  List of relations
Schema | Name | Type  |  Owner |  Size  | Description
--------+------+-------+----------+--------+-------------
public | t2 | table | postgres | 650 MB |
(1 row)

来看看实际的查询差别就知道，BRIN有多么适合流式数据了。
postgres=# explain analyze select * from t1 where id>=1000 and id<=5000;
                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t1  (cost=50.98..9767.60 rows=3803 width=37) (actual time=0.351..13.732 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Rows Removed by Index Recheck: 57567
Heap Blocks: lossy=128
->  Bitmap Index Scan on idx_t1_id  (cost=0.00..50.03 rows=3803 width=0) (actual time=0.104..0.104 rows=1280 loops=1)
      Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.111 ms
Execution time: 14.019 ms
(8 rows)
对于相关性差的，还不如全表扫描。
postgres=# explain analyze select * from t2 where id>=1000 and id<=5000;
                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t2  (cost=49.78..9549.73 rows=3686 width=37) (actual time=2.806..2268.044 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Rows Removed by Index Recheck: 9995999
Heap Blocks: lossy=20791
->  Bitmap Index Scan on idx_t2_id  (cost=0.00..48.86 rows=3686 width=0) (actual time=2.019..2.019 rows=208640 loops=1)
      Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.195 ms
Execution time: 2268.590 ms
(8 rows)
t2全表扫描
postgres=# set enable_bitmapscan=off;
SET
postgres=# explain analyze select * from t2 where id>=1000 and id<=5000;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Seq Scan on t2  (cost=0.00..170791.00 rows=3686 width=37) (actual time=0.593..1881.929 rows=4001 loops=1)
Filter: ((id >= 1000) AND (id <= 5000))
Rows Removed by Filter: 9995999
Planning time: 0.109 ms
Execution time: 1882.397 ms
(5 rows)
接下来BRIN和BTREE索引对比一下。
postgres=# create index idx_t1_id_bt on t1 using btree (id);
CREATE INDEX
postgres=# create index idx_t2_id_bt on t2 using btree (id);
CREATE INDEX
postgres=# set enable_bitmapscan=on;
SET
postgres=# drop index idx_t1_id;
DROP INDEX
postgres=# drop index idx_t2_id;
DROP INDEX
postgres=# explain analyze select * from t1 where id>=1000 and id<=5000;
                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_t1_id_bt on t1  (cost=0.43..102.04 rows=3880 width=37) (actual time=0.023..1.048 rows=4001 loops=1)
Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.412 ms
Execution time: 1.318 ms
(4 rows)

postgres=# explain analyze select * from t2 where id>=1000 and id<=5000;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t2  (cost=53.05..10056.68 rows=3962 width=37) (actual time=1.932..8.304 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Heap Blocks: exact=3642
->  Bitmap Index Scan on idx_t2_id_bt  (cost=0.00..52.05 rows=3962 width=0) (actual time=1.143..1.143 rows=4001 loops=1)
      Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.379 ms
Execution time: 8.621 ms
(7 rows)

我们看到btree索引查询性能是提高了，但是索引大小你看看有多大？
postgres=# \di+
                        List of relations
Schema |    Name    | Type  |  Owner | Table |  Size  | Description
--------+--------------+-------+----------+-------+--------+-------------
public | idx_t1_id_bt | index | postgres | t1 | 213 MB |
public | idx_t2_id_bt | index | postgres | t2 | 213 MB |
(2 rows)
接下调整brin索引的精度提高查询效率，我们了解到默认的brin是存储128个连续的数据块区间的，这个值越小，精度越高。
postgres=# create index idx_t1_id on t1 using brin (id) with (pages_per_range=1);
CREATE INDEX
postgres=# create index idx_t2_id on t2 using brin (id) with (pages_per_range=1);
CREATE INDEX
postgres=# \di+
                        List of relations
Schema |    Name    | Type  |  Owner | Table |  Size  | Description
--------+--------------+-------+----------+-------+--------+-------------
public | idx_t1_id | index | postgres | t1 | 672 kB |
public | idx_t1_id_bt | index | postgres | t1 | 213 MB |
public | idx_t2_id | index | postgres | t2 | 672 kB |
public | idx_t2_id_bt | index | postgres | t2 | 213 MB |
(4 rows)
postgres=# drop index idx_t1_id_bt;
DROP INDEX
postgres=# drop index idx_t2_id_bt;
DROP INDEX
postgres=# explain analyze select * from t1 where id>=1000 and id<=5000;
                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t1  (cost=110.98..9827.60 rows=3803 width=37) (actual time=9.487..10.571 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Rows Removed by Index Recheck: 328
Heap Blocks: lossy=9
->  Bitmap Index Scan on idx_t1_id  (cost=0.00..110.03 rows=3803 width=0) (actual time=9.449..9.449 rows=90 loops=1)
      Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.141 ms
Execution time: 10.853 ms
(8 rows)
postgres=# explain analyze select * from t2 where id>=1000 and id<=5000;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t2  (cost=109.78..9609.73 rows=3686 width=37) (actual time=10.407..481.673 rows=4001 loops=1)
Recheck Cond: ((id >= 1000) AND (id <= 5000))
Rows Removed by Index Recheck: 2125867  # 看看精度不高的后果，取4001条数据却额外扫描了2125867条无用数据
Heap Blocks: lossy=4428
->  Bitmap Index Scan on idx_t2_id  (cost=0.00..108.86 rows=3686 width=0) (actual time=10.364..10.364 rows=44280 loops=1)
      Index Cond: ((id >= 1000) AND (id <= 5000))
Planning time: 0.106 ms
Execution time: 482.077 ms
(8 rows)
精度提高后，扫描效率有一定的提升。（对于相关度不高的就不要用BRIN了，精度提高到1都于事无补的，无用功太多）当然相比btree还有差距，不过对于大数据场景，我们还要考虑数据的插入性能，对于btree插入性能好还是brin的插入性能好呢？
我这里简单的测试了一下，并未涉及并发处理，已经可以明显的了解到btree索引对数据插入带来的开销更大。
postgres=# \d t1
   Table "public.t1"
Column |  Type | Modifiers
--------+---------+-----------
id    | integer |
info | text |
Indexes:
"idx_t1_id" brin (id) WITH (pages_per_range=1)
postgres=# \timing
Timing is on.
postgres=# insert into t1 select generate_series(1,1000000);
INSERT 0 1000000
Time: 2152.527 ms
postgres=# drop index idx_t1_id;
DROP INDEX
Time: 9.527 ms
postgres=# create index idx_t1_id_bt on t1 using btree (id);
CREATE INDEX
Time: 29659.752 ms
postgres=# insert into t1 select generate_series(1,1000000);
INSERT 0 1000000
Time: 5407.971 ms
最后，我们同样可以使用pageinspect来观测brin索引的内容。
postgres=# create extension pageinspect;
CREATE EXTENSION
postgres=# select * from brin_page_items(get_raw_page('idx_t1_id',10),'idx_t1_id');
itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |       value
------------+--------+--------+----------+----------+-------------+----------------------
      1 | 2176 |    1 | f       | f       | f          | {1046657 .. 1047137}
      2 | 2177 |    1 | f       | f       | f          | {1047138 .. 1047618}
      3 | 2178 |    1 | f       | f       | f          | {1047619 .. 1048099}
      4 | 2179 |    1 | f       | f       | f          | {1048100 .. 1048580}
      5 | 2180 |    1 | f       | f       | f          | {1048581 .. 1049061}
      6 | 2181 |    1 | f       | f       | f          | {1049062 .. 1049542}
      7 | 2182 |    1 | f       | f       | f          | {1049543 .. 1050023}
。。。
例如我们看到2176 这个数据块的ID取值区间是{1046657 .. 1047137}，我们使用ctid来验证一下.
postgres=# select min(id),max(id) from t1 where ctid::text ~ E'^\\(2176,';
min | max
---------+---------
1046657 | 1047137
(1 row)
# 完全正确
其他还有几个pageinspect的函数：
postgres=# SELECT brin_page_type(get_raw_page('idx_t1_id', id)) from generate_series(0,10) t(id);
brin_page_type
----------------
meta
revmap
revmap
revmap
revmap
revmap
regular
regular
regular
regular
regular
(11 rows)
postgres=# SELECT * FROM brin_metapage_info(get_raw_page('idx_t1_id', 0));
magic | version | pagesperrange | lastrevmappage
------------+---------+---------------+----------------
0xA8109CFA |    1 |          1 |             5
(1 row)
postgres=# SELECT * FROM brin_revmap_data(get_raw_page('idx_t1_id', 1)) limit 5;
pages
-----------
(18,1105)
(18,1106)
(18,1107)
(18,1108)
(18,1109)
(5 rows)
截止目前，PostgreSQL可以支持btree,hash,gin,gist,spgist,brin共6种索引访问方法。用户可以根据实际应用场景选择合适的索引。

[参考]
1. http://blog.163.com/digoal@126/b ... 040201512810112541/
2. http://www.postgresql.org/docs/devel/static/brin.html
3. http://www.postgresql.org/docs/devel/static/sql-createindex.html
BRIN indexes accept a different parameter:

pages_per_range
Defines the number of table blocks that make up one block range for each entry of a BRIN index (see Section 60.1 for more details). The default is 128.

wanzai01 · 发表于 2015-6-4 14:25

基于流复制的主备出现脑裂后，如何快速修复？
PostgreSQL 9.5 将pg_rewind纳入源码, pg_rewind是一个非常好的工具.
当我们在使用PostgreSQL流复制实施主备复制时, 如果在某些情况下, 例如主机有问题时, 备机激活成为读写, 并对外提供服务.
但是如果主机那边还有读写的话, 主备就会出现脑裂的情况.
出现脑裂之后, 原来的主机要变成备机怎么办?
一般我们需要重建备机, 即把数据文件删掉, 重新同步.
另一种方法是用rsync , 同步修改的文件.
如果数据量很大, 以上两种方法都需要耗费大量的时间.
pg_rewind的原理,
1. 首选获得备机激活的时间线
2. 根据备机激活的时间线, 在老的主机上找到这个时间线之前的最后一个checkpoint
3. 在老的主机根据这个checkpoint位置, 找到自此以后老的主机产生的所有的XLOG.
4. 从这些XLOG中解析出变更的数据块.
5. 从新的主机将这些数据块抓取过来, 并覆盖掉老的主机上的这些数据块. (同时老库上面新增的块被被删掉.)
6. 从新主机拷贝所有除数据文件以外的所有文件(如clog, etc等)到老的主机.
7. 现在老的主机回到了时间线的位置,
pg_rewind退出后只能到达以上状态, 以下步骤需要手工执行.
8. 修改老主机的配置文件, 例如 postgresql.conf, recovery.conf, pg_hba.conf 以成为新主机的standby.
9. 特别需要注意配置 restore_command, 因为新主机在发生promote后产生的XLOG可能已经归档了.
10. 启动老主机, 开始恢复.

接下来测试一下 :
测试环境 :
在同一台主机上测试, 主备配置不同的cluster_name.
cluster_name=db1
cluster_name=db2

db1
$PGDATA /data02/pgdata95/pg_root
$ARCH       /data02/pgdata95/pg_arch
$PGPORT 1922

db2
$PGDATA /data03/pgdata95/pg_root
$ARCH       /data03/pgdata95/pg_arch
$PGPORT 1923

配置primary
# grep "^[a-z]" postgresql.conf
listen_addresses = '0.0.0.0'          # what IP address(es) to listen on;
port = 1922                            # (change requires restart)
max_connections = 199                # (change requires restart)
superuser_reserved_connections = 13    # (change requires restart)
unix_socket_directories = '.' # comma-separated list of directories
unix_socket_permissions = 0700       # begin with 0 to use octal notation
tcp_keepalives_idle = 60             # TCP_KEEPIDLE, in seconds;
tcp_keepalives_interval = 10          # TCP_KEEPINTVL, in seconds;
tcp_keepalives_count = 10             # TCP_KEEPCNT;
shared_buffers = 1024MB                # min 128kB
huge_pages = try                      # on, off, or try
dynamic_shared_memory_type = posix    # the default is the first option
vacuum_cost_delay = 10                # 0-100 milliseconds
vacuum_cost_limit = 10000             # 1-10000 credits
bgwriter_delay = 10ms                # 10-10000ms between rounds
wal_level = logical                   # minimal, archive, hot_standby, or logical
synchronous_commit = off             # synchronization level;
wal_sync_method = open_sync          # the default is the first option
full_page_writes = on                # recover from partial page writes
wal_compression = on                   # enable compression of full-page writes
wal_log_hints = on
wal_writer_delay = 10ms       # 1-10000 milliseconds
archive_mode = on             # allows archiving to be done
archive_command = 'test ! -f /data02/pgdata95/pg_arch/%f && cp %p /data02/pgdata95/pg_arch/%f'       # command to use to archive a logfile segment
max_wal_senders = 12          # max number of walsender processes
max_replication_slots = 12    # max number of replication slots
track_commit_timestamp = on    # collect timestamp of transaction commit
hot_standby = on                      # "on" allows queries during recovery
wal_receiver_status_interval = 1s    # send replies at least this often
hot_standby_feedback = on             # send info from standby to prevent
log_destination = 'csvlog'             # Valid values are combinations of
logging_collector = on       # Enable capturing of stderr and csvlog
log_directory = 'pg_log'             # directory where log files are written,
log_truncate_on_rotation = on          # If on, an existing log file with the
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose          # terse, default, or verbose messages
log_replication_commands = on
log_timezone = 'PRC'
cluster_name = 'db1'                   # added to process titles if nonempty
autovacuum = on                # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'                      # locale for system error message
lc_monetary = 'C'                      # locale for monetary formatting
lc_numeric = 'C'                      # locale for number formatting
lc_time = 'C'                         # locale for time formatting
default_text_search_config = 'pg_catalog.english'
pg_hba.conf
# replication privilege.
local replication    postgres                               trust
host replication    postgres       127.0.0.1/32          trust
host replication    postgres       ::1/128                trust
recovery.conf
pg95@db-172-16-3-150-> cp /opt/pgsql9.5/share/recovery.conf.sample $PGDATA/
pg95@db-172-16-3-150-> cd $PGDATA
pg95@db-172-16-3-150-> mv recovery.conf.sample recovery.done
pg95@db-172-16-3-150-> vi recovery.done
restore_command = 'cp /data03/pgdata95/pg_arch/%f %p'          # e.g. 'cp /mnt/server/archivedir/%f %p'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=127.0.0.1 port=1923 user=postgres keepalives_idle=60'       # e.g. 'host=localhost port=5432'
归档
[root@db-172-16-3-150 postgresql-7320681]# mkdir /data02/pgdata95/pg_arch
[root@db-172-16-3-150 postgresql-7320681]# chown pg95

g95 /data02/pgdata95/pg_arch
启动数据库
pg_ctl restart -m fast

创建primary-standby 环境
[root@db-172-16-3-150 postgresql-7320681]# mkdir -p /data03/pgdata95/pg_root
[root@db-172-16-3-150 postgresql-7320681]# mkdir -p /data03/pgdata95/pg_arch
[root@db-172-16-3-150 postgresql-7320681]# chown -R pg95

g95 /data03/pgdata95
[root@db-172-16-3-150 postgresql-7320681]# chmod 700 /data03/pgdata95/pg_root

[root@db-172-16-3-150 postgresql-7320681]# su - pg95
pg95@db-172-16-3-150-> pg_basebackup -D /data03/pgdata95/pg_root -F p -U postgres -h 127.0.0.1 -p 1922
WARNING:  skipping special file "./.s.PGSQL.1922"
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
修改配置文件 :
pg95@db-172-16-3-150-> cd /data03/pgdata95/pg_root/
pg95@db-172-16-3-150-> mv recovery.done recovery.conf
pg95@db-172-16-3-150-> vi recovery.conf 修改两处
restore_command = 'cp /data02/pgdata95/pg_arch/%f %p'          # e.g. 'cp /mnt/server/archivedir/%f %p'
primary_conninfo = 'host=127.0.0.1 port=1922 user=postgres keepalives_idle=60'       # e.g. 'host=localhost port=5432'

pg95@db-172-16-3-150-> vi postgresql.conf
修改
port = 1923
cluster_name = 'db2'
archive_command = 'test ! -f /data03/pgdata95/pg_arch/%f && cp %p /data03/pgdata95/pg_arch/%f'
启动standby
pg95@db-172-16-3-150-> pg_ctl start -D /data03/pgdata95/pg_root
server starting
pg95@db-172-16-3-150-> LOG:  00000: redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
LOCATION:  SysLogger_Start, syslogger.c:622

可以看到已经启动了
[root@db-172-16-3-150 postgresql-7320681]# ps -ewf|grep pg95
pg95    27194    1  0 15:49 pts/2 00:00:00 /opt/pgsql9.5/bin/postgres
pg95    27196 27194  0 15:49 ?       00:00:00 postgres: db1: logger process
pg95    27198 27194  0 15:49 ?       00:00:00 postgres: db1: checkpointer process
pg95    27199 27194  0 15:49 ?       00:00:00 postgres: db1: writer process
pg95    27200 27194  0 15:49 ?       00:00:00 postgres: db1: wal writer process
pg95    27201 27194  0 15:49 ?       00:00:00 postgres: db1: autovacuum launcher process
pg95    27202 27194  0 15:49 ?       00:00:00 postgres: db1: archiver process last was 000000010000000000000002
pg95    27203 27194  0 15:49 ?       00:00:00 postgres: db1: stats collector process
pg95    27349    1  0 15:57 pts/2 00:00:00 /opt/pgsql9.5/bin/postgres -D /data03/pgdata95/pg_root
pg95    27351 27349  0 15:57 ?       00:00:00 postgres: db2: logger process
pg95    27352 27349  0 15:57 ?       00:00:00 postgres: db2: startup process recovering 000000010000000000000003
pg95    27355 27349  0 15:57 ?       00:00:00 postgres: db2: checkpointer process
pg95    27356 27349  0 15:57 ?       00:00:00 postgres: db2: writer process
pg95    27357 27349  0 15:57 ?       00:00:00 postgres: db2: stats collector process
pg95    27359 27349  0 15:57 ?       00:00:00 postgres: db2: wal receiver process streaming 0/3000140
pg95    27360 27194  0 15:57 ?       00:00:00 postgres: db1: wal sender process postgres 127.0.0.1(16326) streaming 0/3000140
root    27401  1264  0 15:58 pts/2 00:00:00 grep pg95

创建一些测试数据, 执行checkpoint
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922
psql (9.5devel)
Type "help" for help.
postgres=# create table t1(id int, info text);
CREATE TABLE
postgres=# insert into t1 select generate_series(1,1000000), md5(random()::text);
INSERT 0 1000000
postgres=# select sum(hashtext(t1.*::text)) from t1;
   sum
---------------
-585064225655
(1 row)
postgres=# checkpoint;
CHECKPOINT
确保已同步
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1923
psql (9.5devel)
Type "help" for help.
postgres=# select sum(hashtext(t1.*::text)) from t1;
   sum
---------------
-585064225655
(1 row)
激活standby.
pg95@db-172-16-3-150-> pg_ctl promote -D /data03/pgdata95/pg_root
server promoting

此时主备已经没有同步关系了, 模拟脑裂
在老的primary执行.
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922
psql (9.5devel)
Type "help" for help.
postgres=# insert into t1 select generate_series(1,1000000), md5(random()::text);
INSERT 0 1000000
postgres=# insert into t1 select generate_series(1,5000000), md5(random()::text);
INSERT 0 5000000
postgres=# delete from t1;
DELETE 7000000
postgres=# insert into t1 select generate_series(1,5000000), md5(random()::text);
INSERT 0 5000000
postgres=# select sum(hashtext(t1.*::text)) from t1;
   sum
--------------
316362957898
(1 row)
postgres=# create table t2 as select * from t1;
SELECT 5000000
postgres=# create index idx_t1 on t1(id);
CREATE INDEX
postgres=# create function ft() returns void as $$
postgres$# declare
postgres$# begin
postgres$# return;
postgres$# end;
postgres$# $$ language plpgsql;
CREATE FUNCTION
postgres=# alter table t1 add column crt_time timestamp;
ALTER TABLE
postgres=# alter table t1 add constraint pk primary key(id);
ALTER TABLE
postgres=# create view vt1 as select * from t1 where id<100;
CREATE VIEW

在新的primary执行 :
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1923
psql (9.5devel)
Type "help" for help.
postgres=# insert into t1 select generate_series(1,6000000), md5(random()::text);
INSERT 0 6000000
postgres=# delete from t1;
DELETE 7000000
postgres=# insert into t1 select generate_series(1,5000000), md5(random()::text);
INSERT 0 5000000
postgres=# select sum(hashtext(t1.*::text)) from t1;
   sum
---------------
1868065255940
(1 row)

现在的目标是老的primary要变成standby.
停止老的primary :
pg95@db-172-16-3-150-> pg_ctl stop -m fast -D /data02/pgdata95/pg_root
执行测试, 但不执行恢复 :
pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P --debug -n
connected to remote server
fetched file "global/pg_control", length 8192
fetched file "pg_xlog/00000002.history", length 41
The servers diverged at WAL position 0/8BF34E0 on timeline 1.
could not open file "/data02/pgdata95/pg_root/pg_xlog/000000010000000000000008": No such file or directory

could not find previous WAL record at 0/8BF34E0
Failure, exiting
pg_rewind需要知道老的primary在timeline之后发生了哪些变更, 这些变更需要从pg_xlog取出. 这些变更的数据块需要从新库复制过来.
因为老的primary脑裂后经过了大量的数据变更, 很多xlog已经归档了, 所以需要从归档拷贝到pg_xlog目录.
拷贝 :
pg95@db-172-16-3-150-> cd /data02/pgdata95/pg_arch/
pg95@db-172-16-3-150-> ll
total 2.4G
-rw------- 1 pg95 pg95 16M Apr  9 15:50 000000010000000000000001
-rw------- 1 pg95 pg95 16M Apr  9 15:50 000000010000000000000002
-rw------- 1 pg95 pg95 302 Apr  9 15:50 000000010000000000000002.00000060.backup
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000003
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000004
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000005
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000006
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000007
-rw------- 1 pg95 pg95 16M Apr  9 16:02 000000010000000000000008
-rw------- 1 pg95 pg95 16M Apr  9 16:02 000000010000000000000009
.....

pg95@db-172-16-3-150-> cp * /data02/pgdata95/pg_root/pg_xlog/

再次测试 :
pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P -n
connected to remote server
The servers diverged at WAL position 0/8BF34E0 on timeline 1.
Rewinding from last common checkpoint at 0/8BF3438 on timeline 1
reading source file list
reading target file list
reading WAL in target
Need to copy 1755 MB (total source directory size is 1786 MB)
1797713/1797713 kB (100%) copied
creating backup label and updating control file
Done!

测试没问题就可以直接执行了, 去掉-n参数即可.
pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P
connected to remote server
The servers diverged at WAL position 0/8BF34E0 on timeline 1.
Rewinding from last common checkpoint at 0/8BF3438 on timeline 1
reading source file list
reading target file list
reading WAL in target
Need to copy 1755 MB (total source directory size is 1786 MB)
1797714/1797714 kB (100%) copied
creating backup label and updating control file
Done!

执行完后, 老的primary数据库需要手工启动,
启动前, 请先修改配置 :
recovery.conf
pg95@db-172-16-3-150-> cd /data02/pgdata95/pg_root/
pg95@db-172-16-3-150-> ll
total 148K
-rw------- 1 pg95 pg95  175 Apr  9 16:21 backup_label
-rw------- 1 pg95 pg95  206 Apr  9 16:21 backup_label.old
drwx------ 5 pg95 pg95 4.0K Apr  9 16:21 base
drwx------ 2 pg95 pg95 4.0K Apr  9 16:21 global
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_clog
drwx------ 2 pg95 pg95 4.0K Apr  9 11:09 pg_commit_ts
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_dynshmem
-rw------- 1 pg95 pg95 4.4K Apr  9 16:21 pg_hba.conf
-rw------- 1 pg95 pg95 1.6K Apr  9 16:21 pg_ident.conf
drwx------ 2 pg95 pg95 4.0K Apr  9 16:21 pg_log
drwx------ 4 pg95 pg95 4.0K Apr  9 09:04 pg_logical
drwx------ 4 pg95 pg95 4.0K Apr  9 09:04 pg_multixact
drwx------ 2 pg95 pg95 4.0K Apr  9 16:14 pg_notify
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_replslot
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_serial
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_snapshots
drwx------ 2 pg95 pg95 4.0K Apr  9 16:21 pg_stat
drwx------ 2 pg95 pg95 4.0K Apr  9 16:21 pg_stat_tmp
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_subtrans
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_tblspc
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_twophase
-rw------- 1 pg95 pg95 4 Apr  9 09:04 PG_VERSION
drwx------ 3 pg95 pg95  16K Apr  9 16:21 pg_xlog
-rw------- 1 pg95 pg95 88 Apr  9 16:21 postgresql.auto.conf
-rw------- 1 pg95 pg95  22K Apr  9 16:21 postgresql.conf
-rw------- 1 pg95 pg95 27 Apr  9 16:14 postmaster.opts
-rw-r--r-- 1 pg95 pg95 5.7K Apr  9 16:21 recovery.done
pg95@db-172-16-3-150-> mv recovery.done recovery.conf
pg95@db-172-16-3-150-> vi recovery.conf 修改两处
restore_command = 'cp /data03/pgdata95/pg_arch/%f %p'          # e.g. 'cp /mnt/server/archivedir/%f %p'
primary_conninfo = 'host=127.0.0.1 port=1923 user=postgres keepalives_idle=60'       # e.g. 'host=localhost port=5432'

pg95@db-172-16-3-150-> vi postgresql.conf
修改
port = 1922
cluster_name = 'db1'
archive_command = 'test ! -f /data02/pgdata95/pg_arch/%f && cp %p /data02/pgdata95/pg_arch/%f'

启动老的主库, 变成standby,
并检查是否解决了脑裂, 与新的primary完全一致, 例如脑裂过程创建的约束, 函数, 视图都会消失.
pg95@db-172-16-3-150-> pg_ctl start -D /data02/pgdata95/pg_root
server starting
pg95@db-172-16-3-150-> LOG:  00000: redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
LOCATION:  SysLogger_Start, syslogger.c:622

pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922
psql (9.5devel)
Type "help" for help.

postgres=# \dt
      List of relations
Schema | Name | Type  |  Owner
--------+------+-------+----------
public | t1 | table | postgres
public | tbl  | table | postgres
public | test | table | postgres
(3 rows)

postgres=# \dv
No relations found.
postgres=# \d+ t1
                        Table "public.t1"
Column |  Type | Modifiers | Storage  | Stats target | Description
--------+---------+-----------+----------+--------------+-------------
id    | integer |          | plain |             |
info | text |          | extended |             |
postgres=# \df ft
                     List of functions
Schema | Name | Result data type | Argument data types | Type
--------+------+------------------+---------------------+------
(0 rows)
postgres=# select sum(hashtext(t1.*::text)) from t1;
   sum
---------------
1868065255940 数据和新的primary吻合
(1 row)

再检查一下老库上新增的表是否还在? 通过SIZE来区分
pg95@db-172-16-3-150-> psql
psql (9.5devel)
Type "help" for help.

postgres=# select oid from pg_database where datname='postgres';
  oid
-------
13253
(1 row)

postgres=# select pg_relation_filepath('t1');
pg_relation_filepath
----------------------
base/13253/16451
(1 row)

这是t1表
pg95@db-172-16-3-150-> cd /data02/pgdata95/pg_root/base/13253/
pg95@db-172-16-3-150-> ll 16451*
-rw------- 1 pg95 pg95 780M Apr  9 16:31 16451
-rw------- 1 pg95 pg95 128K Apr  9 16:21 16451_fsm
-rw------- 1 pg95 pg95  32K Apr  9 16:31 16451_vm
可以看到, 除了t1, 老库上分裂时创建的t2显然不在了. 注意size
如果你要更加精确的区分, 在老库执行pg_rewind之前, 记录一下t2的pg_relation_filepath, 然后来这里比对看看文件还在不在.
pg95@db-172-16-3-150-> ll -S | head -n 10
total 796M
-rw------- 1 pg95 pg95 780M Apr  9 16:31 16451
-rw------- 1 pg95 pg95 576K Apr  9 16:21 1255
-rw------- 1 pg95 pg95 480K Apr  9 16:21 2618
-rw------- 1 pg95 pg95 448K Apr  9 16:31 2608
-rw------- 1 pg95 pg95 448K Apr  9 16:31 2674
-rw------- 1 pg95 pg95 416K Apr  9 16:31 2673
-rw------- 1 pg95 pg95 384K Apr  9 16:31 1249
-rw------- 1 pg95 pg95 288K Apr  9 12:27 2609
-rw------- 1 pg95 pg95 256K Apr  9 16:21 2691

至此恢复结束.

[注意]
pg_rewind 可以实施的前提条件
1. 需要打开wal_log_hints 选项或者初始化数据库是开启了checksum.
否则会报错 :
pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P --debug -n
connected to remote server
fetched file "global/pg_control", length 8192
target server need to use either data checksums or "wal_log_hints = on"
Failure, exiting
2. 需要开启full_page_writes
3. 执行pg_rewind的过程中, 老的主库需要发生failover前的最后一个checkpoint之后产生的所有xlog都放到pg_xlog目录.
pg_rewind不会帮你执行restore_command自动去获取archive file.
4. 老的主库必须在干净的关闭数据库后执行pg_rewind.

[参考]
1. http://www.postgresql.org/docs/devel/static/app-pgrewind.html
2. http://www.postgresql.org/docs/devel/static/libpq-envars.html

wanzai01 · 发表于 2015-6-4 14:34

本帖最后由 wanzai01 于 2015-6-4 17:12 编辑

PostgreSQL 和 Oracle的小事务，高并发的性能压测例子 - 之Oracle
Oracle版本为12c。硬件和软件环境与测试PostgreSQL的保持一致，同一台主机，测试时关闭其他任何有干扰的服务和软件。
调整Oracle参数，创建测试表，生成测试数据：
SQL> create tablespace tbs_digoal datafile '/data01/oradata/tbs_digoal' size 30G autoextend off;
SQL> create tablespace tbs_digoal_idx datafile '/data02/oradata/tbs_digoal_idx' size 30G autoextend off;

SQL> create user digoal identified by digoal default tablespace tbs_digoal;
SQL> grant resource,connect,dba to digoal;
SQL> grant unlimited tablespace to digoal;

SQL> alter system set processes=2000 scope=spfile;
SQL> alter system set SGA_TARGET=24G scope=spfile;
SQL> alter system set PGA_AGGREGATE_TARGET=1G scope=spfile;
SQL> alter system set db_keep_cache_size=10G scope=spfile;
SQL> alter system set CURSOR_SHARING=force scope=spfile;
SQL> alter system set commit_logging='batch' scope=spfile;
SQL> alter system set commit_write='nowait' scope=spfile;

SQL> shutdown immediate
SQL> startup

SQL> conn digoal/digoal
SQL> create table tbl (id int,info varchar2(256),crt_time date) tablespace tbs_digoal;

插入测试数据：
declare
id int := 1;
begin
loop
exit when id>50000000;
insert into digoal.tbl nologging select id,sysdate,sysdate from dual;
id := id+1;
end loop;
commit;
end;
/

-- 或

declare
maxrecords constant int := 50000000;
begin
for id in 1..maxrecords loop
insert into digoal.tbl nologging select id,sysdate,sysdate from dual;
end loop;
dbms_output.put_line('成功导入数据！');
commit;
end;
/

-- 或
SQL> insert into digoal.tbl nologging select rownum,sysdate,sysdate from dual connect by level <=50000000;

创建主键，将数据加载到内存。
SQL> alter table digoal.tbl add constraint tbl_pkey primary key(id) using index tablespace tbs_digoal_idx;
SQL> alter table digoal.tbl storage (buffer_pool KEEP);
SQL> alter index digoal.tbl_pkey storage (buffer_pool KEEP);

修改profile
# vi /etc/profile
export ORACLE_BASE=/opt/oracle/product
export ORACLE_HOME=$ORACLE_BASE/12.0.2/db_1
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$LD_LIBRARY_PATH
export PATH=$ORACLE_HOME/bin:$PATH

# . /etc/profile

安装python oracle驱动
# wget https://pypi.python.org/packages ... 9c20087ec812c7092ab
# tar -zxvf cx_Oracle-5.1.3.tar.gz
# cd cx_Oracle-5.1.3
# python setup.py install

python测试脚本：
# vi test.py
import threading
import time
import cx_Oracle
import random

dsn=cx_Oracle.makedsn('127.0.0.1', 1521, 'oradb')

xs=12000
tps=dict()

class n_t(threading.Thread): # The timer class is derived from the class threading.Thread
  def __init__(self, num):
threading.Thread.__init__(self)
self.thread_num = num

  def run(self): #Overwrite run() method, put what you want the thread do here
db=cx_Oracle.connect('digoal','digoal',dsn)
db.autocommit=1
cur=db.cursor()
pre=cur.prepare('update /*SQL' + str(self.thread_num) + '*/ tbl set info=sysdate,crt_time=sysdate where id=:id')

tps[self.thread_num] = dict()

f = open("/tmp/oracle_test." + str(self.thread_num), "w")

for x in range(1,3001):
   start_t = time.time()
   for i in range(0,xs):
      cur.execute(pre,{'id':random.randrange(1,50000000)})
   stop_t = time.time()
   tps[self.thread_num][x] = round(xs/(stop_t-start_t),2)
   res = "Round: " + str(x) + " TID: " + str(self.thread_num) + " Sec: " + str(round((stop_t-start_t),2)) + " tps: " + str(tps[self.thread_num][x])
   print >> f, res
   f.flush()

f.close()

def test():
  t_names = []
  for i in xrange(0,26):
t_names.append(n_t(i))

  for t in t_names:
t.start()

  return

if __name__ == '__main__':
  test()

更新(26 conn)测试：
# nohup python ./test.py >/dev/null 2>&1 &
测试结果，约10秒输出一次tps：
# less /tmp/oracle_test.22
Round: 1 TID: 22 Sec: 11.46 tps: 1187.12
Round: 2 TID: 22 Sec: 11.34 tps: 1198.65
.....
其他线程结果略。

测试结果合并：
# vi tps.py
import fileinput

file = dict()
line = dict()
tps=0

for i in xrange(0,26):
  file = open("/tmp/oracle_test." + str(i))

while 1:
for i in xrange(0,26):
   line = file.readline().split(" ",8)[7]
   tps = tps + float(line)

print(str(tps))
tps=0

if not line[0]:
   break

for i in xrange(0,26):
  file.close()

# python tps.py
33792.57
29093.92
33370.67
33768.3
33750.5
33458.86
30580.32
33369.79
......
图：

Oracle UPDATE TPS集中在32000到33000，比较平顺。

查询(36 conn)测试：
测试脚本：
# vi test.py
import threading
import time
import cx_Oracle
import random

dsn=cx_Oracle.makedsn('127.0.0.1', 1521, 'oradb')

xs=12000
tps=dict()

class n_t(threading.Thread): # The timer class is derived from the class threading.Thread
  def __init__(self, num):
threading.Thread.__init__(self)
self.thread_num = num

  def run(self): #Overwrite run() method, put what you want the thread do here
db=cx_Oracle.connect('digoal','digoal',dsn)
db.autocommit=1
cur=db.cursor()
pre=cur.prepare('select /*SQL' + str(self.thread_num) + '*/ * from tbl where id=:id')

tps[self.thread_num] = dict()

f = open("/tmp/oracle_test." + str(self.thread_num), "w")

for x in range(1,3001):
   start_t = time.time()
   for i in range(0,xs):
      cur.execute(pre,{'id':random.randrange(1,50000000)})
   stop_t = time.time()
   tps[self.thread_num][x] = round(xs/(stop_t-start_t),2)
   res = "Round: " + str(x) + " TID: " + str(self.thread_num) + " Sec: " + str(round((stop_t-start_t),2)) + " tps: " + str(tps[self.thread_num][x])
   print >> f, res
   f.flush()

f.close()

def test():
  t_names = []
  for i in xrange(0,36):
t_names.append(n_t(i))

  for t in t_names:
t.start()

  return

if __name__ == '__main__':
  test()
测试结果：
# less oracle_test.0
Round: 1 TID: 0 Sec: 13.37 tps: 897.85
Round: 2 TID: 0 Sec: 11.72 tps: 1023.83
Round: 3 TID: 0 Sec: 11.89 tps: 1009.26
Round: 4 TID: 0 Sec: 11.82 tps: 1015.2
Round: 5 TID: 0 Sec: 11.84 tps: 1013.37
Round: 6 TID: 0 Sec: 11.91 tps: 1007.87
Round: 7 TID: 0 Sec: 11.97 tps: 1002.33
Round: 8 TID: 0 Sec: 11.96 tps: 1003.0
Round: 9 TID: 0 Sec: 11.92 tps: 1007.11
Round: 10 TID: 0 Sec: 11.75 tps: 1020.97
Round: 11 TID: 0 Sec: 11.75 tps: 1021.17
Round: 12 TID: 0 Sec: 11.95 tps: 1004.07
Round: 13 TID: 0 Sec: 11.81 tps: 1015.96
......
合并结果：
36637.71
36436.34
36512.47
36614.69
36585.26
36553.55
36578.89
36540.29
36483.53
36594.29
36438.53
36518.64
36480.38
......
图：

插入(20 conn)测试：
SQL> truncate table digoal.tbl;
SQL> create sequence digoal.seq cache 12000;
SQL> alter table digoal.tbl modify crt_time default sysdate;
SQL> alter table digoal.tbl modify info default sysdate;
SQL> alter table digoal.tbl storage (buffer_pool default);
SQL> alter index digoal.tbl_pkey storage (buffer_pool default);
测试脚本：
# vi test.py
import threading
import time
import cx_Oracle
import random

dsn=cx_Oracle.makedsn('127.0.0.1', 1521, 'oradb')

xs=12000
tps=dict()

class n_t(threading.Thread): # The timer class is derived from the class threading.Thread
  def __init__(self, num):
threading.Thread.__init__(self)
self.thread_num = num

  def run(self): #Overwrite run() method, put what you want the thread do here
db=cx_Oracle.connect('digoal','digoal',dsn)
db.autocommit=1
cur=db.cursor()
pre=cur.prepare('insert /*SQL' + str(self.thread_num) +'*/ into tbl (id) values(seq.nextval)')

tps[self.thread_num] = dict()

f = open("/tmp/oracle_test." + str(self.thread_num), "w")

for x in range(1,3001):
   start_t = time.time()
   for i in range(0,xs):
      cur.execute(pre)
   stop_t = time.time()
   tps[self.thread_num][x] = round(xs/(stop_t-start_t),2)
   res = "Round: " + str(x) + " TID: " + str(self.thread_num) + " Sec: " + str(round((stop_t-start_t),2)) + " tps: " + str(tps[self.thread_num][x])
   print >> f, res
   f.flush()

f.close()

def test():
  t_names = []
  for i in xrange(0,20):
t_names.append(n_t(i))

  for t in t_names:
t.start()

  return

if __name__ == '__main__':
  test()
测试结果：
# less oracle_test.19
Round: 1 TID: 19 Sec: 22.98 tps: 522.25
Round: 2 TID: 19 Sec: 27.12 tps: 442.54
Round: 3 TID: 19 Sec: 23.8 tps: 504.2
Round: 4 TID: 19 Sec: 26.47 tps: 453.35
Round: 5 TID: 19 Sec: 23.63 tps: 507.83
Round: 6 TID: 19 Sec: 23.78 tps: 504.72
Round: 7 TID: 19 Sec: 26.02 tps: 461.13
Round: 8 TID: 19 Sec: 23.9 tps: 502.02
Round: 9 TID: 19 Sec: 26.22 tps: 457.67
Round: 10 TID: 19 Sec: 23.48 tps: 511.14
Round: 11 TID: 19 Sec: 26.89 tps: 446.31
Round: 12 TID: 19 Sec: 22.46 tps: 534.37
......
合并结果：
10195.18
8863.84
10067.91
9167.06
9753.85
10024.51
9028.92
10006.42
8869.49
10251.39
9202.53
10048.2
10062.34
......
图：

Oracle插入时需先写UNDO,再写redo,所以比较慢.

[其他]
1. 一开始测试脚本报错，ORA-24550 KPEDBG_HDL_PUSH_FCPTRMAX。
解决办法：
# cd /opt/oracle/product/12.0.2/db_1/network/admin/
# cp samples/sqlnet.ora ./
# vi sqlnet.ora
追加 :
DIAG_ADR_ENABLED=FALSE
DIAG_SIGHANDLER_ENABLED=FALSE
DIAG_DDE_ENABLED=FALSE
2. 测试时遇到过大量的CURSOR pin S的等待事件，通过添加注释，改写SQL解决。
例如：
pre=cur.prepare('insert /*SQL' + str(self.thread_num) +'*/ into tbl (id) values(seq.nextval)')
[参考]
1. http://blog.163.com/digoal@126/b ... 040201541104656600/

wanzai01 · 发表于 2015-6-4 14:36

PostgreSQL行级安全控制的新方法。

PostgreSQL 9.5 新增了一个非常给力的安全相关特性, 精细化控制用户对数据的可视性, 可写性.
这种方法有利于隔离控制共享表在多个用户之间的数据呈现和使用.

实现方法,
创建针对表和角色的策略, 不同的角色对表记录的查询, 插入, 更新, 删除可以有不同的控制方法.
CREATE POLICY name ON table_name
[ FOR { ALL | SELECT | INSERT | UPDATE | DELETE } ]
[ TO { role_name | PUBLIC } [, ...] ]
[ USING ( using_expression ) ]
[ WITH CHECK ( check_expression ) ]
using 针对已经存在的记录的校验. 可实施在select, update, delete, ALL上.
whth check 针对将要新增的记录的校验, 可实施在insert, update, ALL上.
需要注意的是, UPDATE因为涉及旧的记录和新的记录, 如果只写了using , 但是没有提供with check的话, using同时会当成with check来使用进行检查.
如果针对同样的命令创建了多个策略, 所有策略中任意一个为TRUE都通过.
例如ALL, SELECT个创建了一个策略for role r1, 执行select时任意一个为TRUE都通过.
例如SELECT个创建了多个策略for role r1, 执行select时任意一个为TRUE都通过.

测试 :　
去git下载一个开发版本
http://git.postgresql.org/gitweb/?p=postgresql.git;a=summary
[root@db-172-16-3-150 soft_bak]# tar -zxvf postgresql-7320681.tar.gz
[root@db-172-16-3-150 soft_bak]# cd postgresql-7320681
[root@db-172-16-3-150 postgresql-7320681]# ./configure --prefix=/opt/pgsql9.5 --with-pgport=1922 --with-perl --with-python --with-tcl --with-openssl --with-pam --with-ldap --with-libxml --with-libxslt --enable-thread-safety --with-blocksize=32 --enable-debug
[root@db-172-16-3-150 postgresql-7320681]# gmake world && gmake install-world

[root@db-172-16-3-150 postgresql-7320681]# useradd pg95
[root@db-172-16-3-150 ~]# su - pg95
[pg95@db-172-16-3-150 ~]$ vi .bash_profile
# add by digoal
export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=1922
export PGDATA=/data02/pgdata95/pg_root
export LANG=en_US.utf8
export PGHOME=/opt/pgsql9.5
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi

[root@db-172-16-3-150 ~]# mkdir /data02/pgdata95
[root@db-172-16-3-150 ~]# chown pg95

g95 /data02/pgdata95

[root@db-172-16-3-150 ~]# su - pg95
pg95@db-172-16-3-150-> initdb -D $PGDATA -E UTF8 --locale=C -U postgres -W

创建三个角色
postgres=# create role r1 login;
CREATE ROLE
postgres=# create role r2 login;
CREATE ROLE
postgres=# create role r3 login;
CREATE ROLE
创建测试表
postgres=# create table test(id int, r name);
CREATE TABLE
postgres=# insert into test values(1, 'r1');
INSERT 0 1
postgres=# insert into test values(2, 'r2');
INSERT 0 1
postgres=# insert into test values(3, 'r3');
INSERT 0 1
postgres=# grant all on table test to public;
GRANT
创建一个新增数据的策略(使用with check)
postgres=# create policy p on test for insert to r1 with check( r = current_user);
CREATE POLICY
默认情况下策略是disable状态的,
postgres=# \d+ test
                     Table "public.test"
Column |  Type | Modifiers | Storage | Stats target | Description
--------+---------+-----------+---------+--------------+-------------
id    | integer |          | plain |             |
r    | name |          | plain |             |
Policies (Row Security Disabled):
POLICY "p" FOR INSERT
   TO r1
   WITH CHECK (r = "current_user"())
通过pg_policies视图可以查看已经创建的策略.
postgres=# select * from pg_policies ;
schemaname | tablename | policyname | roles |  cmd | qual |    with_check
------------+-----------+------------+-------+--------+------+------------------------
public    | test    | p       | {r1}  | INSERT |    | (r = "current_user"())
(1 row)
在策略enable前, 是无视策略的.
postgres=> insert into test values(4,'r1');
INSERT 0 1
postgres=> insert into test values(4,'r2');
INSERT 0 1
使策略生效
postgres=# alter table test enable row level security;
ALTER TABLE
postgres=> \d+ test
                     Table "public.test"
Column |  Type | Modifiers | Storage | Stats target | Description
--------+---------+-----------+---------+--------------+-------------
id    | integer |          | plain |             |
r    | name |          | plain |             |
Policies:
POLICY "p" FOR INSERT
   TO r1
   WITH CHECK (r = "current_user"())
现在策略生效了, 再次插入, 你会看到只能插入和r1角色同名的r值.
postgres=# \c postgres r1
You are now connected to database "postgres" as user "r1".
postgres=> insert into test values(4,'r2');
ERROR:  new row violates WITH CHECK OPTION for "test"
postgres=> insert into test values(4,'r1');
INSERT 0 1
再新增一个策略, 现在r1角色插入test表时, 允许r字段的值为'r1','r2'.
postgres=# create policy p1 on test for insert to r1 with check( r = 'r2');
CREATE POLICY
postgres=# \c postgres r1
You are now connected to database "postgres" as user "r1".
postgres=> insert into test values(4,'r2');
INSERT 0 1
postgres=> insert into test values(4,'r1');
INSERT 0 1
postgres=> insert into test values(4,'r3');
ERROR:  new row violates WITH CHECK OPTION for "test"

接下来创建旧值的策略. r1用户只能查看到r=current_user的值.
postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".\
postgres=# create policy p2 on test for select to r1 using ( r = current_user);
CREATE POLICY
postgres=# \c postgres r1
You are now connected to database "postgres" as user "r1".
postgres=> select * from test;
id | r
----+----
  1 | r1
  4 | r1
  4 | r1
  4 | r1
(4 rows)
当然, 我们也可以创建一个针对所有用户的策略, 例如, 所有用户只能看到 r = current_user 的值.
postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
postgres=# create policy p3 on test for select to public using ( r = current_user);
CREATE POLICY
postgres=# \c postgres r2
You are now connected to database "postgres" as user "r2".
postgres=> select * from test;
id | r
----+----
  2 | r2
  4 | r2
  4 | r2
(3 rows)

当然了, 所有这些策略只针对非超级用户以及非owner

postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
postgres=# select * from test;
id | r
----+----
  1 | r1
  2 | r2
  3 | r3
  4 | r1
  4 | r2
  4 | r1
  4 | r2
  4 | r1
(8 rows)
把r1改为超级用户, 策略失效.
postgres=# alter role r1 superuser;
ALTER ROLE
postgres=# \c postgres r1
You are now connected to database "postgres" as user "r1".
postgres=# select * from test;
id | r
----+----
  1 | r1
  2 | r2
  3 | r3
  4 | r1
  4 | r2
  4 | r1
  4 | r2
  4 | r1
(8 rows)

对于update操作, 因为先需要查看数据, 然后才是插入数据, 所以先会执行using检查, 然后执行with check检查. 如果只有using, 那么with check还是需要检查的, 只不过会使用using策略.
如果只有with check则在查询数据时不检查, 但是插入时检查.

另外需要说明一旦对用户创建了策略, 必须在所有命令上创建, 否则默认采用拒绝方式.
src/backend/rewrite/rowsecurity.c
if row-level security is enabled on the table and there
is no policy which applies, then a default-deny policy will be used.

例如, 现在有1个update的策略.
postgres=# \d test
   Table "public.test"
Column |  Type | Modifiers
--------+---------+-----------
id    | integer |
r    | name |
Policies:
POLICY "p4" FOR UPDATE
   TO r3
   USING (r = "current_user"())
postgres=# \c postgres r3
You are now connected to database "postgres" as user "r3".
因为针对r3角色创建了update策略, 但是没有创建其他命令的策略, 所以其他命令的策略默认为FALSE?
postgres=> select * from test;
id | r
----+---
(0 rows)
更新操作应用了策略.
postgres=> update test set id=4 where r='r3';
UPDATE 1
postgres=> select * from test;
id | r
----+---
(0 rows)
现在创建SELECT的策略, 可以查询了
postgres=# create policy p1 on test for select to r3 using ( r = current_user);
CREATE POLICY
postgres=# \d+ test
                     Table "public.test"
Column |  Type | Modifiers | Storage | Stats target | Description
--------+---------+-----------+---------+--------------+-------------
id    | integer |          | plain |             |
r    | name |          | plain |             |
Policies:
POLICY "p1" FOR SELECT
   TO r3
   USING (r = "current_user"())
POLICY "p4" FOR UPDATE
   TO r3
   USING (r = "current_user"())
postgres=# \c postgres r3
You are now connected to database "postgres" as user "r3".
postgres=> select * from test;
id | r
----+----
  4 | r3
(1 row)
但是delete命令上还没有创建策略, 所以删除操作直接FALSE.
postgres=> delete from test ;
DELETE 0
在r1角色上, 没有创建任何策略, 所以操作是允许的.
postgres=> \c postgres r1
You are now connected to database "postgres" as user "r1".
postgres=# select * from test;
id | r
----+----
  1 | r1
  2 | r2
  4 | r1
  4 | r2
  4 | r1
  4 | r2
  4 | r1
  4 | r3
(8 rows)

在使用pg_dump导出或者使用pg_restore导入时. 如果表enable了row security策略.
普通用户在执行pg_dump或pg_restore时会报错.
例如 :
pg95@db-172-16-3-150-> pg_dump -U r2 -t test postgres
.....
COPY test (id, r) FROM stdin;
pg_dump: [archiver (db)] query failed: ERROR:  insufficient privilege to bypass row security.
pg_dump: [archiver (db)] query was: COPY public.test (id, r) TO stdout;

最后要讲一下使用策略的过滤顺序, 还记得视图攻击吗?
http://blog.163.com/digoal@126/b ... 7040201431410032638
同样, leakproof 函数也在策略过滤器前调用.
例子 :
普通函数
postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
postgres=# create or replace function attack(int,name) returns boolean as $$
postgres$# declare
postgres$# begin
postgres$# raise notice '%,%', $1,$2;
postgres$# return true;
postgres$# end;
postgres$# $$ language plpgsql cost 0.00000000000000000000001 ;
CREATE FUNCTION
postgres=# \c postgres r3
You are now connected to database "postgres" as user "r3".
postgres=> select * from test where attack(id,r);
NOTICE:  4,r3
id | r
----+----
  4 | r3
(1 row)
普通函数在语句中的过滤是在策略过滤器后面执行
postgres=> explain select * from test where attack(id,r);
                           QUERY PLAN
--------------------------------------------------------------------
Subquery Scan on test  (cost=0.00..61.32 rows=6 width=68)
Filter: attack(test.id, test.r)
->  Seq Scan on test test_1  (cost=0.00..61.15 rows=17 width=68)
      Filter: (r = "current_user"())
(4 rows)
如果修改为leakproof函数的话
postgres=> \c postgres postgres
You are now connected to database "postgres" as user "postgres".
postgres=# alter function attack(int,name) LEAKPROOF;
ALTER FUNCTION
就直接上升到和过滤器一起执行了
postgres=# \c postgres r3
postgres=> explain select * from test where attack(id,r);
                  QUERY PLAN
------------------------------------------------------
Seq Scan on test  (cost=0.00..61.15 rows=6 width=68)
Filter: (attack(id, r) AND (r = "current_user"()))
(2 rows)
可以从函数中读取出所有数据
postgres=> select * from test where attack(id,r);
NOTICE:  1,r1
NOTICE:  2,r2
NOTICE:  4,r1
NOTICE:  4,r2
NOTICE:  4,r1
NOTICE:  4,r2
NOTICE:  4,r1
NOTICE:  4,r3
id | r
----+----
  4 | r3
(1 row)

[参考]
1. http://www.postgresql.org/docs/devel/static/sql-createpolicy.html
2. http://www.postgresql.org/docs/d ... IG-CLIENT-STATEMENT
row_security (enum)
This variable controls if row security policies are to be applied to queries which are run against tables that have row security enabled. The default is 'on'. When set to 'on', all users, except superusers and the owner of the table, will have the row policies for the table applied to their queries. The table owner and superuser can request that row policies be applied to their queries by setting this to 'force'. Lastly, this can also be set to 'off' which will bypass row policies for the table, if possible, and error if not.

For a user who is not a superuser and not the table owner to bypass row policies for the table, they must have the BYPASSRLS role attribute. If this is set to 'off' and the user queries a table which has row policies enabled and the user does not have the right to bypass row policies then a permission denied error will be returned.

The allowed values of row_security are on (apply normally - not to superuser or table owner), off (fail if row security would be applied), and force (apply always - even to superuser and table owner).

For more information on row security policies, see CREATE POLICY.

wanzai01 · 发表于 2015-6-4 14:43

为什么用了pgpool-II后，处理能力下降严峻？
为了找到pgpool-II慢在什么地方, 这里用到stap进行跟踪.
pgpool-II默认开启了debug, 所以直接编译即可, 不需要修改Makefile
[root@db-172-16-3-150 pgpool-II-3.4.1]# less Makefile
默认开启了debug, 如下
CFLAGS = -g -O2 -Wall -Wmissing-prototypes -Wmissing-declarations

# man gcc
Options for Debugging Your Program or GCC
   GCC has various special options that are used for debugging either your program or GCC:

   -g  Produce debugging information in the operating system’s native format (stabs, COFF, XCOFF, or DWARF 2).
         GDB can work with this debugging information.

         On most systems that use stabs format, -g enables use of extra debugging information that only GDB can use;
         this extra information makes debugging work better in GDB but will probably make other debuggers crash or
         refuse to read the program.  If you want to control for certain whether to generate the extra information,
         use -gstabs+, -gstabs, -gxcoff+, -gxcoff, or -gvms (see below).

         GCC allows you to use -g with -O.  The shortcuts taken by optimized code may occasionally produce
         surprising results: some variables you declared may not exist at all; flow of control may briefly move
         where you did not expect it; some statements may not be executed because they compute constant results or
         their values were already at hand; some statements may execute in different places because they were moved
         out of loops.

         Nevertheless it proves possible to debug optimized output.  This makes it reasonable to use the optimizer
         for programs that might have bugs.

         The following options are useful when GCC is generated with the capability for more than one debugging
         format.
测试用到3台服务器,
1. db
8核 Intel(R) Xeon(R) CPU          E5405  @ 2.00GHz
flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx tm2 cx16 xtpr lahf_lm

2. pgbench
8核 Intel(R) Xeon(R) CPU          E5405  @ 2.00GHz
flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx tm2 cx16 xtpr lahf_lm

3. pgpool
8核 Intel(R) Xeon(R) CPU          E5504  @ 2.00GHz
flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dts tpr_shadow vnmi flexpriority ept vpid

安装pgpool-II
[root@db-172-16-3-150 src]# cd /opt/soft_bak/
[root@db-172-16-3-150 soft_bak]# wget http://www.pgpool.net/download.php?f=pgpool-II-3.4.1.tar.gz

[root@db-172-16-3-150 soft_bak]# tar -zxvf pgpool-II-3.4.1.tar.gz
[root@db-172-16-3-150 soft_bak]# cd pgpool-II-3.4.1

[root@db-172-16-3-150 pgpool-II-3.4.1]# ./configure --prefix=/opt/pgpool3.4.1 --with-pgsql=/opt/pgsql
[root@db-172-16-3-150 pgpool-II-3.4.1]# gmake && gmake install
[root@db-172-16-3-150 pgpool-II-3.4.1]# vi /etc/ld.so.conf
/opt/pgpool3.4.1/lib

[root@db-172-16-3-150 pgpool-II-3.4.1]# ldconfig
[root@db-172-16-3-150 pgpool-II-3.4.1]# ldconfig -p|grep pgpool
      libpcp.so.0 (libc6,x86-64) => /opt/pgpool3.4.1/lib/libpcp.so.0
      libpcp.so (libc6,x86-64) => /opt/pgpool3.4.1/lib/libpcp.so

[root@db-172-16-3-150 pgpool-II-3.4.1]# vi /etc/profile
export PATH=/opt/pgpool3.4.1/bin:$PATH

配置文件样本如下
[root@db-172-16-3-150 pgpool-II-3.4.1]# cd src/sample/
[root@db-172-16-3-150 sample]# ll
total 176K
-rw-rw-r-- 1 postgres postgres 2.0K Feb  5 18:28 dist_def_pgbench.sql
-rw-rw-r-- 1 postgres postgres  858 Feb  5 18:28 pcp.conf.sample
-rw-rw-r-- 1 postgres postgres  33K Feb  5 18:33 pgpool.conf.sample
-rw-rw-r-- 1 postgres postgres  33K Feb  5 18:33 pgpool.conf.sample-master-slave
-rw-rw-r-- 1 postgres postgres  33K Feb  5 18:33 pgpool.conf.sample-replication
-rw-rw-r-- 1 postgres postgres  33K Feb  5 18:33 pgpool.conf.sample-stream
-rw-rw-r-- 1 postgres postgres 71 Feb  5 18:28 pgpool.pam
-rw-rw-r-- 1 postgres postgres  695 Feb  5 18:28 pgpool_recovery
-rw-rw-r-- 1 postgres postgres  656 Feb  5 18:28 pgpool_recovery_pitr
-rw-rw-r-- 1 postgres postgres  239 Feb  5 18:28 pgpool_remote_start
-rw-rw-r-- 1 postgres postgres 3.2K Feb  5 18:28 pool_hba.conf.sample
-rw-rw-r-- 1 postgres postgres  420 Feb  5 18:28 replicate_def_pgbench.sql

pgpool-II启动命令, 需要配置3个配置文件.

[root@db-172-16-3-150 sample]# pgpool --help
pgpool-II version 3.4.1 (tataraboshi),
  A generic connection pool/replication/load balance server for PostgreSQL

Usage:
  pgpool [ -c] [ -f CONFIG_FILE ] [ -F PCP_CONFIG_FILE ] [ -a HBA_CONFIG_FILE ]
      [ -n ] [ -D ] [ -d ]
  pgpool [ -f CONFIG_FILE ] [ -F PCP_CONFIG_FILE ] [ -a HBA_CONFIG_FILE ]
      [ -m SHUTDOWN-MODE ] stop
  pgpool [ -f CONFIG_FILE ] [ -F PCP_CONFIG_FILE ] [ -a HBA_CONFIG_FILE ] reload

Common options:
  -a, --hba-file=HBA_CONFIG_FILE
                  Sets the path to the pool_hba.conf configuration file
                  (default: /opt/pgpool3.4.1/etc/pool_hba.conf)
  -f, --config-file=CONFIG_FILE
                  Sets the path to the pgpool.conf configuration file
                  (default: /opt/pgpool3.4.1/etc/pgpool.conf)
  -F, --pcp-file=PCP_CONFIG_FILE
                  Sets the path to the pcp.conf configuration file
                  (default: /opt/pgpool3.4.1/etc/pcp.conf)
  -h, --help       Prints this help

Start options:
  -C, --clear-oidmaps Clears query cache oidmaps when memqcache_method is memcached
                  (If shmem, discards whenever pgpool starts.)
  -n, --dont-detach Don't run in daemon mode, does not detach control tty
  -x, --debug-assertions Turns on various assertion checks, This is a debugging aid
  -D, --discard-status Discard pgpool_status file and do not restore previous status
  -d, --debug       Debug mode

Stop options:
  -m, --mode=SHUTDOWN-MODE
                  Can be "smart", "fast", or "immediate"

Shutdown modes are:
  smart    quit after all clients have disconnected
  fast       quit directly, with proper shutdown
  immediate the same mode as fast

将配置文件样本拷贝到对应位置.
[root@db-172-16-3-150 sample]# pwd
/opt/soft_bak/pgpool-II-3.4.1/src/sample
[root@db-172-16-3-150 sample]# mkdir /etc/pgpool
[root@db-172-16-3-150 sample]# cp pcp.conf.sample /etc/pgpool/pcp.conf
[root@db-172-16-3-150 sample]# cp pgpool.conf.sample /etc/pgpool/pgpool.conf
[root@db-172-16-3-150 sample]# cp pool_hba.conf.sample /etc/pgpool/pool_hba.conf
[root@db-172-16-3-150 sample]# chown -R postgres

ostgres /etc/pgpool

修改3个配置文件, 为了方便测试, 我们这里使用trust认证.
[root@db-172-16-3-150 pgpool]# which pg_md5
/opt/pgpool3.4.1/bin/pg_md5
[root@db-172-16-3-150 pgpool]# pg_md5 -p
password: 输入密码,如digoal
ebe067295ca3f3b8040d0f4d955d2d72

[root@db-172-16-3-150 pgpool]# cd /etc/pgpool/
[root@db-172-16-3-150 pgpool]# vi pcp.conf
postgres:ebe067295ca3f3b8040d0f4d955d2d72

[root@db-172-16-3-150 pgpool]# vi pool_hba.conf
# "local" is for Unix domain socket connections only
local all       all                            trust
# IPv4 local connections:
host all       all       127.0.0.1/32       trust
host all       all       ::1/128             trust
host all all 0.0.0.0/0 trust

我们这里只用连接池模式, 不使用其他任何附加功能.
[root@db-172-16-3-150 pgpool]# vi pgpool.conf
# ----------------------------
# pgPool-II configuration file
# ----------------------------

# - pgpool Connection Settings -

listen_addresses = '0.0.0.0'
                                 # Host name or IP address to listen on:
                                 # '*' for all, '' for no TCP/IP connections
                                 # (change requires restart)
port = 9999
                                 # Port number
                                 # (change requires restart)
socket_dir = '/tmp'
                                 # Unix domain socket path
                                 # The Debian package defaults to
                                 # /var/run/postgresql
                                 # (change requires restart)
listen_backlog_multiplier = 2
                                 # Set the backlog parameter of listen(2) to
                                                               # num_init_children * listen_backlog_multiplier.
                                 # (change requires restart)

# - pgpool Communication Manager Connection Settings -

pcp_listen_addresses = '*'
                                 # Host name or IP address for pcp process to listen on:
                                 # '*' for all, '' for no TCP/IP connections
                                 # (change requires restart)
pcp_port = 9898
                                 # Port number for pcp
                                 # (change requires restart)
pcp_socket_dir = '/tmp'
                                 # Unix domain socket path for pcp
                                 # The Debian package defaults to
                                 # /var/run/postgresql
                                 # (change requires restart)

# - Backend Connection Settings -

backend_hostname0 = '172.16.3.176'
                                 # Host name or IP address to connect to for backend 0
backend_port0 = 1921
                                 # Port number for backend 0
backend_weight0 = 1
                                 # Weight for backend 0 (only in load balancing mode)
backend_data_directory0 = '/data02/pgdata/pg_root'
                                 # Data directory for backend 0
backend_flag0 = 'ALLOW_TO_FAILOVER'
                                 # Controls various backend behavior
                                 # ALLOW_TO_FAILOVER or DISALLOW_TO_FAILOVER

# - Authentication -

enable_pool_hba = off
                                 # Use pool_hba.conf for client authentication
pool_passwd = 'pool_passwd'
                                 # File name of pool_passwd for md5 authentication.
                                 # "" disables pool_passwd.
                                 # (change requires restart)
authentication_timeout = 60
                                 # Delay in seconds to complete client authentication
                                 # 0 means no timeout.

# - SSL Connections -

ssl = off
                                 # Enable SSL support
                                 # (change requires restart)
#ssl_key = './server.key'
                                 # Path to the SSL private key file
                                 # (change requires restart)
#ssl_cert = './server.cert'
                                 # Path to the SSL public certificate file
                                 # (change requires restart)
#ssl_ca_cert = ''
                                 # Path to a single PEM format file
                                 # containing CA root certificate(s)
                                 # (change requires restart)
#ssl_ca_cert_dir = ''
                                 # Directory containing CA root certificate(s)
                                 # (change requires restart)

#------------------------------------------------------------------------------
# POOLS
#------------------------------------------------------------------------------

# - Pool size -

num_init_children = 32
                                 # Number of pools
                                 # (change requires restart)
max_pool = 4
                                 # Number of connections per pool
                                 # (change requires restart)

# - Life time -

child_life_time = 300
                                 # Pool exits after being idle for this many seconds
child_max_connections = 0
                                 # Pool exits after receiving that many connections
                                 # 0 means no exit
connection_life_time = 0
                                 # Connection to backend closes after being idle for this many seconds
                                 # 0 means no close
client_idle_limit = 0
                                 # Client is disconnected after being idle for that many seconds
                                 # (even inside an explicit transactions!)
                                 # 0 means no disconnection

#------------------------------------------------------------------------------
# LOGS
#------------------------------------------------------------------------------

# - Where to log -

log_destination = 'syslog'
                                 # Where to log
                                 # Valid values are combinations of stderr,
                                 # and syslog. Default to stderr.

#------------------------------------------------------------------------------
# FILE LOCATIONS
#------------------------------------------------------------------------------

pid_file_name = '/var/run/pgpool/pgpool.pid'
                                 # PID file name
                                 # (change requires restart)
logdir = '/var/log/pgpool'
                                 # Directory of pgPool status file
                                 # (change requires restart)

#------------------------------------------------------------------------------
# CONNECTION POOLING
#------------------------------------------------------------------------------

connection_cache = on
                                 # Activate connection pools
                                 # (change requires restart)

                                 # Semicolon separated list of queries
                                 # to be issued at the end of a session
                                 # The default is for 8.3 and later
reset_query_list = 'ABORT; DISCARD ALL'
                                 # The following one is for 8.2 and before
#reset_query_list = 'ABORT; RESET ALL; SET SESSION AUTHORIZATION DEFAULT'

#------------------------------------------------------------------------------
# REPLICATION MODE
#------------------------------------------------------------------------------

replication_mode = off
                                 # Activate replication mode
                                 # (change requires restart)

#------------------------------------------------------------------------------
# LOAD BALANCING MODE
#------------------------------------------------------------------------------

load_balance_mode = off
                                 # Activate load balancing mode

#------------------------------------------------------------------------------
# MASTER/SLAVE MODE
#------------------------------------------------------------------------------

master_slave_mode = off

#------------------------------------------------------------------------------
# HEALTH CHECK
#------------------------------------------------------------------------------

health_check_period = 0
                                 # Health check period
                                 # Disabled (0) by default

#------------------------------------------------------------------------------
# ONLINE RECOVERY
#------------------------------------------------------------------------------

...

#------------------------------------------------------------------------------
# WATCHDOG
#------------------------------------------------------------------------------

# - Enabling -

use_watchdog = off
                                 # Activates watchdog
...

#------------------------------------------------------------------------------
# OTHERS
#------------------------------------------------------------------------------
...

#------------------------------------------------------------------------------
# IN MEMORY QUERY MEMORY CACHE
#------------------------------------------------------------------------------
...

创建日志目录和run pid目录.
[root@db-172-16-3-150 pgpool]# mkdir /var/log/pgpool
[root@db-172-16-3-150 pgpool]# mkdir /var/run/pgpool

启动pgpool
[root@db-172-16-3-150 pgpool]# pgpool -a /etc/pgpool/pool_hba.conf -f /etc/pgpool/pgpool.conf -F /etc/pgpool/pcp.conf

[root@db-172-16-3-150 pgpool]# tail -f -n 10 /var/log/messages
Apr  8 08:40:51 db-172-16-3-150 pgpool[1780]: [1-1] 2015-04-08 08:40:51: pid 1780: LOG:  Backend status file /var/log/pgpool/pgpool_status does not exist
Apr  8 08:40:51 db-172-16-3-150 pgpool[1780]: [1-2] 2015-04-08 08:40:51: pid 1780: LOCATION:  pgpool_main.c:2755
Apr  8 08:40:51 db-172-16-3-150 pgpool[1780]: [2-1] 2015-04-08 08:40:51: pid 1780: LOG:  Setting up socket for 0.0.0.0:9999
Apr  8 08:40:51 db-172-16-3-150 pgpool[1780]: [2-2] 2015-04-08 08:40:51: pid 1780: LOCATION:  pgpool_main.c:797
Apr  8 08:40:51 db-172-16-3-150 pgpool[1780]: [3-1] 2015-04-08 08:40:51: pid 1780: LOG:  pgpool-II successfully started. version 3.4.1 (tataraboshi)
Apr  8 08:40:51 db-172-16-3-150 pgpool[1780]: [3-2] 2015-04-08 08:40:51: pid 1780: LOCATION:  pgpool_main.c:330

测试是否可以正常连接
postgres@db-172-16-3-150-> psql -h 127.0.0.1 -p 9999 -U postgres postgres
psql (9.4.1)
Type "help" for help.
postgres=# \q

准备用于压力测试的脚本, 因为我们测的是pgpool-II, 而不是数据库, 所以使用最简单的查询.
postgres@db-172-16-3-174-> vi test.sql
select 1;

在数据库本地, 直接连接数据库测试, 结果如下 :
postgres@176-> pgbench -M prepared -n -r -f ./test.sql -h 127.0.0.1 -p 1921 -U postgres -c 16 -j 8 -T 120 postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 8
duration: 120 s
number of transactions actually processed: 13798268
tps = 114984.315254 (including connections establishing)
tps = 114995.257173 (excluding connections establishing)
statement latencies in milliseconds:
      0.137798       select 1;
在测试机, 直接连接数据库测试, 结果如下 :
postgres@db-172-16-3-174-> pgbench -M prepared -n -r -f ./test.sql -h 172.16.3.176 -p 1921 -U postgres -c 16 -j 8 -T 120 postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 8
duration: 120 s
number of transactions actually processed: 7534770
tps = 62788.712417 (including connections establishing)
tps = 62794.058850 (excluding connections establishing)
statement latencies in milliseconds:
      0.252501       select 1;
在测试机, 连接pgpool-II测试, 结果如下 :
postgres@db-172-16-3-174-> pgbench -M prepared -n -r -f ./test.sql -h 172.16.3.150 -p 9999 -U postgres -c 16 -j 8 -T 120 postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 8
duration: 120 s
number of transactions actually processed: 2065233
tps = 17209.981713 (including connections establishing)
tps = 17211.768999 (excluding connections establishing)
statement latencies in milliseconds:
      0.765831       select 1;
在pgpool-II机器本地, 直接连接数据库测试, 结果如下 :
postgres@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -h 172.16.3.176 -p 1921 -U postgres -c 16 -j 8 -T 120 postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 8
duration: 120 s
number of transactions actually processed: 8632236
latency average: 0.222 ms
tps = 71934.282729 (including connections establishing)
tps = 71943.727533 (excluding connections establishing)
statement latencies in milliseconds:
      0.220962       select 1;

在pgpool-II机器本地, 直接连接pgpool-II测试, 结果如下 :
postgres@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -h 127.0.0.1 -p 9999 -U postgres -c 16 -j 8 -T 120 postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 8
duration: 120 s
number of transactions actually processed: 2033971
latency average: 0.944 ms
tps = 16949.161299 (including connections establishing)
tps = 16951.540647 (excluding connections establishing)
statement latencies in milliseconds:
      0.771560       select 1;

从测试数据来看, pgpool-II 3.4.1版本的性能损耗依旧.
下面是一stap来进行跟踪.
安装systemtap
[root@db-172-16-3-150 src]# uname -r
2.6.32-504.el6.x86_64
yum install -y kernel-devel-2.6.32-504.el6.x86_64 systemtap
编写跟踪脚本
# vi cat trc.stp
global f_start[999999],f_stop[999999]

probe process("/opt/pgpool3.4.1/bin/pgpool"

.function("*@/opt/soft_bak/pgpool-II-3.4.1/src/*"

.call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ns()
}

probe process("/opt/pgpool3.4.1/bin/pgpool"

.function("*@/opt/soft_bak/pgpool-II-3.4.1/src/*"

.return {
  t=gettimeofday_ns()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
f_stop[a,d] <<< t - f_start[a,c,e,b]
  }

}

probe timer.s(5) {
  foreach ([a,d] in f_stop @sum - limit 50 ) {
printf("avg_ns:%d, sum_ns:%d, cnt:%d, execname:%s, pp:%s\n", @avg(f_stop[a,d]), @sum(f_stop[a,d]), @count(f_stop[a,d]), a, d)
  }
  exit()
}

在pgbench测试机开启pgbench, 连接pgpool-II
postgres@db6-> pgbench -M prepared -n -r -f ./test.sql -h 172.16.3.150 -p 9999 -U postgres -c 16 -j 8 -T 50 postgres
Client 5 aborted in state 0. Probably the backend died while processing.
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 8
duration: 50 s
number of transactions actually processed: 784156
tps = 15682.483918 (including connections establishing)
tps = 15686.457333 (excluding connections establishing)
statement latencies in milliseconds:
      0.808510       select 1;

同时开启stap, 确保可以跟踪到压力测试过程.  收集5秒的测试数据报告如下 :
[root@db-172-16-3-150 ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp

Pass 1: parsed user script and 116 library script(s) using 212504virt/40068res/3172shr/37724data kb, in 330usr/30sys/358real ms.
Pass 2: analyzed script: 937 probe(s), 7 function(s), 4 embed(s), 2 global(s) using 218240virt/46796res/4184shr/43460data kb, in 160usr/80sys/328real ms.
Pass 3: translated to C into "/tmp/stapooq1sT/stap_cfcc5355227f07104ee2359b8481c4a1_394533_src.c" using 218244virt/47344res/4532shr/43464data kb, in 100usr/80sys/169real ms.
Pass 4: compiled C into "stap_cfcc5355227f07104ee2359b8481c4a1_394533.ko" in 2590usr/310sys/2868real ms.
Pass 5: starting run.
avg_ns:5266, sum_ns:1909464893, cnt:362594, execname

gpool, pp

rocess("/opt/pgpool3.4.1/bin/pgpool"

.function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024"

.return
avg_ns:19393, sum_ns:255681973, cnt:13184, execname

gpool, pp

rocess("/opt/pgpool3.4.1/bin/pgpool"

.function("appendStringInfoVA@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:114"

.return
avg_ns:5178, sum_ns:204855738, cnt:39558, execname

gpool, pp

rocess("/opt/pgpool3.4.1/bin/pgpool"

.function("initStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:47"

.return
avg_ns:14494, sum_ns:191096928, cnt:13184, execname

gpool, pp

rocess("/opt/pgpool3.4.1/bin/pgpool").function("pg_vsnprintf@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:202").return
avg_ns:10092, sum_ns:133073987, cnt:13186, execname

gpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("makeStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:29").return
avg_ns:9907, sum_ns:130620770, cnt:13184, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("dopr@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:355").return
avg_ns:5339, sum_ns:35205838, cnt:6593, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("appendStringInfoChar@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:174").return
WARNING: Number of errors: 0, skipped probes: 20377
Pass 5: run completed in 10usr/12590sys/22383real ms.

有些函数因为锁超时跳过, 从剩下的结果来看, 大部分耗时在这个函数. 调用次数最多, 累计耗时最长.
dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c
static void
dopr_outch(int c, PrintfTarget *target)
{
      if (target->bufend != NULL && target->bufptr >= target->bufend)
      {
            /* buffer full, can we dump to stream? */
            if (target->stream == NULL)
                     return;                      /* no, lose the data */
            flushbuffer(target);
      }
      *(target->bufptr++) = c;
}
未完

wanzai01 · 发表于 2015-6-4 14:44

为什么用了pgpool-II后，处理能力下降严峻？续
接下来我把这个函数内容清空, 重新编译pgpool-II再次测试, 结果依旧
postgres@db6-> pgbench -M prepared -n -r -f ./test.sql -h 172.16.3.150 -p 9999 -U postgres -c 16 -j 8 -T 50 postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 8
duration: 50 s
number of transactions actually processed: 844017
tps = 16879.764062 (including connections establishing)
tps = 16884.295568 (excluding connections establishing)
statement latencies in milliseconds:
0.781028 select 1;

接下来做一下单SQL的所有函数跟踪.
postgres@db-172-16-3-150-> psql -h 127.0.0.1 -p 9999 -U postgres postgres
psql (9.4.1, server 9.3.5)
Type "help" for help.
postgres=#
定位到POOL的pid, 用于stap跟踪.
ps -ewf|grep pgpool
root 21962 21961 0 13:36 ? 00:00:00 pgpool: postgres postgres 127.0.0.1(44044) idle
稍微修改一下跟踪脚本
[root@db-172-16-3-150 ~]# vi trc.stp
global f_start[999999],f_stop[999999]

probe process("/opt/pgpool3.4.1/bin/pgpool"

.function("*@/opt/soft_bak/pgpool-II-3.4.1/src/*"

.call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ns()
}

probe process("/opt/pgpool3.4.1/bin/pgpool"

.function("*@/opt/soft_bak/pgpool-II-3.4.1/src/*"

.return {
  t=gettimeofday_ns()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
f_stop[a,d] <<< t - f_start[a,c,e,b]
printf("%d, %s\n", t - f_start[a,c,e,b], d)
# or you can printf("%s \n", $$locals$$)
  }

}

probe end {
  foreach ([a,d] in f_stop @sum - limit 50 ) {
printf("avg_ns:%d, sum_ns:%d, cnt:%d, execname:%s, pp:%s\n", @avg(f_stop[a,d]), @sum(f_stop[a,d]), @count(f_stop[a,d]), a, d)
  }
  exit()
}

跟踪过程中执行一条查询.
postgres=# select * from pg_class limit 1;
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallv
isible | reltoastrelid | reltoastidxid | relhasindex | relisshared | relpersistence | relkind | relnatts | relchecks | relhasoids |
relhaspkey | relhasrules | relhastriggers | relhassubclass | relispopulated | relfrozenxid | relminmxid |          relacl
  | reloptions
--------------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+--------
-------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+------------+-
-----------+-------------+----------------+----------------+----------------+--------------+------------+---------------------------
--+------------
pg_statistic |          11 | 10818 |       0 |    10 |    0 |    12629 |          0 |    15 |    388 |
15 |       2840 |          0 | t          | f          | p             | r    |    26 |       0 | f       |
f       | f          | f             | f             | t             |       1800 |       1 | {postgres=arwdDxt/postgres
} |
(1 row)

Time: 2.566 ms

跟踪结果 :
[root@db-172-16-3-150 ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 21962
Pass 1: parsed user script and 116 library script(s) using 212508virt/40068res/3172shr/37728data kb, in 320usr/30sys/355real ms.
Pass 2: analyzed script: 937 probe(s), 7 function(s), 4 embed(s), 2 global(s) using 218660virt/47212res/4168shr/43880data kb, in 190usr/80sys/345real ms.
Pass 3: translated to C into "/tmp/stapX1vB5l/stap_3d39e5129acf59add49b6bf0e6dbd2cb_420727_src.c" using 218668virt/47752res/4512shr/43888data kb, in 100usr/70sys/174real ms.
Pass 4: compiled C into "stap_3d39e5129acf59add49b6bf0e6dbd2cb_420727.ko" in 2570usr/350sys/2866real ms.
Pass 5: starting run.
5768, process("/opt/pgpool3.4.1/bin/pgpool"

.function("core_yylex_init@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:3279"

.return
4786, process("/opt/pgpool3.4.1/bin/pgpool"

.function("core_yyensure_buffer_stack@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:2945"

.return
10659, process("/opt/pgpool3.4.1/bin/pgpool"

.function("core_yy_switch_to_buffer@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:2740"

.return
16177, process("/opt/pgpool3.4.1/bin/pgpool").function("core_yy_scan_buffer@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:2995").return
22334, process("/opt/pgpool3.4.1/bin/pgpool").function("scanner_init@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.l:1070").return
4711, process("/opt/pgpool3.4.1/bin/pgpool").function("parser_init@/opt/soft_bak/pgpool-II-3.4.1/src/parser/gram.y:14022").return
6561, process("/opt/pgpool3.4.1/bin/pgpool").function("ScanKeywordLookup@/opt/soft_bak/pgpool-II-3.4.1/src/parser/kwlookup.c:38").return
12141, process("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:1325").return
17366, process("/opt/pgpool3.4.1/bin/pgpool").function("base_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:108").return
4026, process("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:1325").return
8694, process("/opt/pgpool3.4.1/bin/pgpool").function("base_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:108").return
4399, process("/opt/pgpool3.4.1/bin/pgpool").function("new_list@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:68").return
9963, process("/opt/pgpool3.4.1/bin/pgpool").function("lcons@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:264").return
3421, process("/opt/pgpool3.4.1/bin/pgpool").function("new_list@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:68").return
8066, process("/opt/pgpool3.4.1/bin/pgpool").function("lcons@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:264").return
4211, process("/opt/pgpool3.4.1/bin/pgpool").function("ScanKeywordLookup@/opt/soft_bak/pgpool-II-3.4.1/src/parser/kwlookup.c:38").return
8851, process("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:1325").return
13218, process("/opt/pgpool3.4.1/bin/pgpool").function("base_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:108").return
4509, process("/opt/pgpool3.4.1/bin/pgpool").function("ScanKeywordLookup@/opt/soft_bak/pgpool-II-3.4.1/src/parser/kwlookup.c:38").return
4511, process("/opt/pgpool3.4.1/bin/pgpool").function("downcase_truncate_identifier@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scansup.c:131").return
10093, process("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:1325").return
14532, process("/opt/pgpool3.4.1/bin/pgpool").function("base_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:108").return
4516, process("/opt/pgpool3.4.1/bin/pgpool").function("ScanKeywordLookup@/opt/soft_bak/pgpool-II-3.4.1/src/parser/kwlookup.c:38").return
9213, process("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:1325").return
13574, process("/opt/pgpool3.4.1/bin/pgpool").function("base_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:108").return
4585, process("/opt/pgpool3.4.1/bin/pgpool").function("makeRangeVar@/opt/soft_bak/pgpool-II-3.4.1/src/parser/makefuncs.c:421").return
3432, process("/opt/pgpool3.4.1/bin/pgpool").function("new_list@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:68").return
7994, process("/opt/pgpool3.4.1/bin/pgpool").function("lcons@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:264").return
6101, process("/opt/pgpool3.4.1/bin/pgpool").function("process_integer_literal@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.l:1177").return
10490, process("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:1325").return
15001, process("/opt/pgpool3.4.1/bin/pgpool").function("base_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:108").return
4647, process("/opt/pgpool3.4.1/bin/pgpool").function("makeIntConst@/opt/soft_bak/pgpool-II-3.4.1/src/parser/gram.y:13370").return
4495, process("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:1325").return
9066, process("/opt/pgpool3.4.1/bin/pgpool").function("base_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:108").return
3499, process("/opt/pgpool3.4.1/bin/pgpool").function("new_list@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:68").return
8153, process("/opt/pgpool3.4.1/bin/pgpool").function("lcons@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:264").return
3497, process("/opt/pgpool3.4.1/bin/pgpool").function("lcons@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:264").return
4836, process("/opt/pgpool3.4.1/bin/pgpool").function("list_nth@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:415").return
3732, process("/opt/pgpool3.4.1/bin/pgpool").function("list_nth@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:415").return
4430, process("/opt/pgpool3.4.1/bin/pgpool").function("list_concat@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:326").return
9653, process("/opt/pgpool3.4.1/bin/pgpool").function("insertSelectOptions@/opt/soft_bak/pgpool-II-3.4.1/src/parser/gram.y:13588").return
3565, process("/opt/pgpool3.4.1/bin/pgpool").function("new_list@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:68").return
8187, process("/opt/pgpool3.4.1/bin/pgpool").function("lcons@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:264").return
3923, process("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:1325").return
8453, process("/opt/pgpool3.4.1/bin/pgpool").function("base_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:108").return
14031, process("/opt/pgpool3.4.1/bin/pgpool").function("base_yyparse@/opt/soft_bak/pgpool-II-3.4.1/src/parser/gram.c:20532").return
4695, process("/opt/pgpool3.4.1/bin/pgpool").function("scanner_finish@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.l:1108").return
10165, process("/opt/pgpool3.4.1/bin/pgpool").function("raw_parser@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:49").return
6052, process("/opt/pgpool3.4.1/bin/pgpool").function("initStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:47").return
15669, process("/opt/pgpool3.4.1/bin/pgpool").function("makeStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:29").return
4255, process("/opt/pgpool3.4.1/bin/pgpool").function("initStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:47").return
13428, process("/opt/pgpool3.4.1/bin/pgpool").function("makeStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:29").return
4565, process("/opt/pgpool3.4.1/bin/pgpool").function("appendStringInfoChar@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:174").return
3760, process("/opt/pgpool3.4.1/bin/pgpool").function("initStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:47").return
3342, process("/opt/pgpool3.4.1/bin/pgpool").function("initStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:47").return
4465, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3308, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3455, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3250, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3340, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3286, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3327, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3306, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3298, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3282, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3337, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3260, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3352, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3295, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3308, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3284, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3306, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3278, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3330, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3286, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3291, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3290, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3304, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3284, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
7861, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:355").return
12794, process("/opt/pgpool3.4.1/bin/pgpool").function("pg_vsnprintf@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:202").return
17946, process("/opt/pgpool3.4.1/bin/pgpool").function("appendStringInfoVA@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:114").return
3430, process("/opt/pgpool3.4.1/bin/pgpool").function("initStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:47").return
3286, process("/opt/pgpool3.4.1/bin/pgpool").function("initStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:47").return
3300, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3314, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3316, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3324, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3280, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3328, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3286, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3292, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3257, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3352, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3282, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3306, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3283, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3330, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3286, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3288, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3292, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3294, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3278, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3300, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3276, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3328, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3289, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3268, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3347, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3296, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3280, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3301, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3276, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3329, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
3286, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
7872, process("/opt/pgpool3.4.1/bin/pgpool").function("dopr@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:355").return
12239, process("/opt/pgpool3.4.1/bin/pgpool").function("pg_vsnprintf@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:202").return
16750, process("/opt/pgpool3.4.1/bin/pgpool").function("appendStringInfoVA@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:114").return

^C

avg_ns:3323, sum_ns:182786, cnt:55, execname

gpool, pp

rocess("/opt/pgpool3.4.1/bin/pgpool").function("dopr_outch@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:1024").return
avg_ns:12488, sum_ns:99904, cnt:8, execname

gpool, pp

rocess("/opt/pgpool3.4.1/bin/pgpool").function("base_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:108").return
avg_ns:7904, sum_ns:63232, cnt:8, execname

gpool, pp

rocess("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:1325").return
avg_ns:7643, sum_ns:45860, cnt:6, execname

gpool, pp

rocess("/opt/pgpool3.4.1/bin/pgpool").function("lcons@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:264").return
avg_ns:17348, sum_ns:34696, cnt:2, execname

gpool, pp

rocess("/opt/pgpool3.4.1/bin/pgpool").function("appendStringInfoVA@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:114").return
avg_ns:14548, sum_ns:29097, cnt:2, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("makeStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:29").return
avg_ns:12516, sum_ns:25033, cnt:2, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("pg_vsnprintf@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:202").return
avg_ns:4020, sum_ns:24125, cnt:6, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("initStringInfo@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:47").return
avg_ns:22334, sum_ns:22334, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("scanner_init@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.l:1070").return
avg_ns:4949, sum_ns:19797, cnt:4, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("ScanKeywordLookup@/opt/soft_bak/pgpool-II-3.4.1/src/parser/kwlookup.c:38").return
avg_ns:3663, sum_ns:18316, cnt:5, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("new_list@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:68").return
avg_ns:16177, sum_ns:16177, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("core_yy_scan_buffer@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:2995").return
avg_ns:7866, sum_ns:15733, cnt:2, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("dopr@/opt/soft_bak/pgpool-II-3.4.1/src/parser/snprintf.c:355").return
avg_ns:14031, sum_ns:14031, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("base_yyparse@/opt/soft_bak/pgpool-II-3.4.1/src/parser/gram.c:20532").return
avg_ns:10659, sum_ns:10659, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("core_yy_switch_to_buffer@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:2740").return
avg_ns:10165, sum_ns:10165, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("raw_parser@/opt/soft_bak/pgpool-II-3.4.1/src/parser/parser.c:49").return
avg_ns:9653, sum_ns:9653, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("insertSelectOptions@/opt/soft_bak/pgpool-II-3.4.1/src/parser/gram.y:13588").return
avg_ns:4284, sum_ns:8568, cnt:2, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("list_nth@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:415").return
avg_ns:6101, sum_ns:6101, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("process_integer_literal@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.l:1177").return
avg_ns:5768, sum_ns:5768, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("core_yylex_init@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:3279").return
avg_ns:4786, sum_ns:4786, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("core_yyensure_buffer_stack@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.c:2945").return
avg_ns:4711, sum_ns:4711, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("parser_init@/opt/soft_bak/pgpool-II-3.4.1/src/parser/gram.y:14022").return
avg_ns:4695, sum_ns:4695, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("scanner_finish@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scan.l:1108").return
avg_ns:4647, sum_ns:4647, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("makeIntConst@/opt/soft_bak/pgpool-II-3.4.1/src/parser/gram.y:13370").return
avg_ns:4585, sum_ns:4585, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("makeRangeVar@/opt/soft_bak/pgpool-II-3.4.1/src/parser/makefuncs.c:421").return
avg_ns:4565, sum_ns:4565, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("appendStringInfoChar@/opt/soft_bak/pgpool-II-3.4.1/src/parser/stringinfo.c:174").return
avg_ns:4511, sum_ns:4511, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("downcase_truncate_identifier@/opt/soft_bak/pgpool-II-3.4.1/src/parser/scansup.c:131").return
avg_ns:4430, sum_ns:4430, cnt:1, execname:pgpool, pp:process("/opt/pgpool3.4.1/bin/pgpool").function("list_concat@/opt/soft_bak/pgpool-II-3.4.1/src/parser/list.c:326").return
Pass 5: run completed in 10usr/14470sys/20108real ms.
时间总和 698965us = 0.698965 毫秒. 也可以理解为pgpool-II处理一个查询时, 自身的开心约0.698965毫秒, (实际上应该略小, 因为stap计算时间时也有一定的开销, 在这个基础上, 最后给个经验值约 0.5毫秒左右吧.)
(和本文前面测试中的0.765831-0.252501 = 0.51333毫秒差不多)
这部分时间是select * from pg_class limit 1; 在pgpool-II代码中的耗费.

wanzai01 · 发表于 2015-6-4 15:08

当我需要将比较长的字符串或bytea转换为一串较短的字符串或数值是？用md5还是用hashtext好呢？

我们在PostgreSQL中存储较长的字符串时，如果数据量较大，并且要实现精确匹配的话，有什么好的方法呢？
首选想到的是索引，但是当字符串太长，超出索引长度限制(见src/backend/access/common/indextuple.c，长度限制为INDEX_SIZE_MASK)，是无法创建索引的。另外，字符串的索引效率也比较低下。
那么有其他方法吗？
我们可以将字符串转换为更短的字符或数字，例如md5, hashtext.
这些转换不一定是一一对应的，所以在查询出来之后，还需要使用字符串进行最终的过滤。
例如：
postgres=# create table t2 (info text);
postgres=# insert into t2 values (repeat(md5(random()::text),100000));
INSERT 0 1
postgres=# create index idx_test on t2(info);
ERROR:  54000: index row requires 36688 bytes, maximum size is 8191
LOCATION:  index_form_tuple, indextuple.c:174
postgres=# select hashtext(info),md5(info) from t2;
  hashtext |             md5
-------------+----------------------------------
-1986549156 | b62583576dcc9820316585a1714acc78
我们看到md5返回的还是字符串，而hashtext返回int.
postgres=# \df hashtext
                        List of functions
Schema | Name | Result data type | Argument data types |  Type
------------+----------+------------------+---------------------+--------
pg_catalog | hashtext | integer       | text             | normal
(1 row)

postgres=# \df md5
                        List of functions
Schema | Name | Result data type | Argument data types |  Type
------------+------+------------------+---------------------+--------
pg_catalog | md5  | text          | bytea             | normal
pg_catalog | md5  | text          | text             | normal
(2 rows)
那么哪个计算效率高一点呢？
postgres=# \timing
Timing is on.
postgres=# select count(hashtext(i::text)) from generate_series(1,1000000) t(i);
  count
---------
1000000
(1 row)
Time: 662.121 ms
postgres=# select count(md5(i::text)) from generate_series(1,1000000) t(i);
  count
---------
1000000
(1 row)
Time: 986.254 ms
显然hashtext计算效率略高。
那么转换为索引的效率哪个更高？我分别插入1000万条数据，然后使用pgbench测试检索速度。
postgres=# create table t_md5(info text);
CREATE TABLE
Time: 4.594 ms
postgres=# create table t_hashtext(info int);
CREATE TABLE
Time: 2.874 ms
postgres=# insert into t_md5 select md5(i::text) from generate_series(1,10000000) t(i);
INSERT 0 10000000
Time: 34427.290 ms
postgres=# insert into t_hashtext select hashtext(i::text) from generate_series(1,10000000) t(i);
INSERT 0 10000000
Time: 22678.304 ms
创建索引速度对比
postgres=# create index idx_t_hashtext on t_hashtext(info);
CREATE INDEX
Time: 27838.217 ms
postgres=# create index idx_t_md5 on t_md5(info);
CREATE INDEX
Time: 58043.625 ms
表，索引大小对比
postgres=# \dt+ t_hashtext
                     List of relations
Schema | Name | Type  |  Owner |  Size  | Description
--------+------------+-------+----------+--------+-------------
public | t_hashtext | table | postgres | 344 MB |
(1 row)

postgres=# \dt+ t_md5
                  List of relations
Schema | Name  | Type  |  Owner |  Size  | Description
--------+-------+-------+----------+--------+-------------
public | t_md5 | table | postgres | 650 MB |
(1 row)

postgres=# \di+ idx_t_hashtext
                           List of relations
Schema |    Name    | Type  |  Owner | Table |  Size  | Description
--------+----------------+-------+----------+------------+--------+-------------
public | idx_t_hashtext | index | postgres | t_hashtext | 213 MB |
(1 row)

postgres=# \di+ idx_t_md5
                        List of relations
Schema | Name | Type  |  Owner | Table |  Size  | Description
--------+-----------+-------+----------+-------+--------+-------------
public | idx_t_md5 | index | postgres | t_md5 | 553 MB |
(1 row)
查询速度对比，首选确保使用了索引。
postgres=# explain select * from t_hashtext where info=hashtext(1::text);
                                    QUERY PLAN
--------------------------------------------------------------------------------------
Index Only Scan using idx_t_hashtext on t_hashtext  (cost=0.43..8.45 rows=1 width=4)
Index Cond: (info = (-631133447))
(2 rows)
postgres=# explain select * from t_md5 where info=md5(1::text);
                              QUERY PLAN
-----------------------------------------------------------------------------
Index Only Scan using idx_t_md5 on t_md5  (cost=0.43..8.45 rows=1 width=33)
Index Cond: (info = 'c4ca4238a0b923820dcc509a6f75849b'::text)
(2 rows)
使用pgbench测试
postgres@db-172-16-3-150-> vi test.sql
\setrandom id 1 10000000
select * from t_md5 where info=md5(:id::text);
postgres@db-172-16-3-150-> pgbench -M prepared -f ./test.sql -n -r -c 16 -j 8 -T 30
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 8
duration: 30 s
number of transactions actually processed: 1944319
latency average: 0.247 ms
tps = 64809.373871 (including connections establishing)
tps = 64833.546859 (excluding connections establishing)
statement latencies in milliseconds:
      0.002463       \setrandom id 1 10000000
      0.242614       select * from t_md5 where info=md5(:id::text);

postgres@db-172-16-3-150-> vi test.sql
\setrandom id 1 10000000
select * from t_hashtext where info=hashtext(:id::text);
postgres@db-172-16-3-150-> pgbench -M prepared -f ./test.sql -n -r -c 16 -j 8 -T 30
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 8
duration: 30 s
number of transactions actually processed: 2045818
latency average: 0.235 ms
tps = 68192.664950 (including connections establishing)
tps = 68217.163836 (excluding connections establishing)
statement latencies in milliseconds:
      0.002492       \setrandom id 1 10000000
      0.230370       select * from t_hashtext where info=hashtext(:id::text);
查询效率相差不大.
以上测试用到的主机2颗CPU Intel(R) Xeon(R) CPU          E5504  @ 2.00GHz，
系统为CentOS 6.x x64。

[参考]
1. src/backend/access/hash/hashfunc.c
Datum
hashtext(PG_FUNCTION_ARGS)
{
      text    *key = PG_GETARG_TEXT_PP(0);
      Datum          result;

      /*
      * Note: this is currently identical in behavior to hashvarlena, but keep
      * it as a separate function in case we someday want to do something
      * different in non-C locales.  (See also hashbpchar, if so.)
      */
      result = hash_any((unsigned char *) VARDATA_ANY(key),
                                       VARSIZE_ANY_EXHDR(key));

      /* Avoid leaking memory for toasted inputs */
      PG_FREE_IF_COPY(key, 0);

      return result;
}

...
/*
* hash_any() -- hash a variable-length key into a 32-bit value
*             k             : the key (the unaligned variable-length array of bytes)
*             len          : the length of the key, counting by bytes
*
* Returns a uint32 value.  Every bit of the key affects every bit of
* the return value.  Every 1-bit and 2-bit delta achieves avalanche.
* About 6*len+35 instructions. The best hash table sizes are powers
* of 2.  There is no need to do mod a prime (mod is sooo slow!).
* If you need less than 32 bits, use a bitmask.
*
* Note: we could easily change this function to return a 64-bit hash value
* by using the final values of both b and c.  b is perhaps a little less
* well mixed than c, however.
*/
Datum
hash_any(register const unsigned char *k, register int keylen)
{
......
}
2. src/backend/utils/adt/varlena.c
/*
* Create an md5 hash of a text string and return it as hex
*
* md5 produces a 16 byte (128 bit) hash; double it for hex
*/
#define MD5_HASH_LEN  32

Datum
md5_text(PG_FUNCTION_ARGS)
{
      text    *in_text = PG_GETARG_TEXT_PP(0);
      size_t       len;
      char          hexsum[MD5_HASH_LEN + 1];

      /* Calculate the length of the buffer using varlena metadata */
      len = VARSIZE_ANY_EXHDR(in_text);

      /* get the hash result */
      if (pg_md5_hash(VARDATA_ANY(in_text), len, hexsum) == false)
            ereport(ERROR,
                              (errcode(ERRCODE_OUT_OF_MEMORY),
                              errmsg("out of memory")));

      /* convert to text and return it */
      PG_RETURN_TEXT_P(cstring_to_text(hexsum));
}
3. backend/libpq/md5.c
md5计算函数。
4. src/backend/access/common/indextuple.c
      /*
      * Here we make sure that the size will fit in the field reserved for it
      * in t_info.
      */
      if ((size & INDEX_SIZE_MASK) != size)
            ereport(ERROR,
                              (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                              errmsg("index row requires %zu bytes, maximum size is %zu",
                                             size, (Size) INDEX_SIZE_MASK)));
5. src/include/access/itup.h
#define INDEX_SIZE_MASK 0x1FFF

实际上，我们还需要考虑索引分裂的问题，后面再讲。