案例说明:
KingbaseES V8R6集群物理备份支持single-pro方式,本案例在集群执行single-pro方式备份并多次切换集群后,对集群执行了恢复测试,文档记录了恢复的详细过程。
适用版本:
KingbaseES V8R6
集群架构:
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------1 | node1 | primary | * running | | running | 44482 | no | n/a2 | node2 | standby | running | node1 | running | 15369 | no | 1 second(s) ago
一、查看集群备份
在执行single-prod模式的物理备份的初始化(sys_backup.sh init)后,同时会在主备库节点执行备份:
1、主库备份
[kingbase@node201 bin]$ /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_rman --config=/home/kingbase/kbbr_repo/sys_rman.conf --stanza=kingbase info
WARN: set process-max 4 is too large, auto set to CPU core count 1
stanza: kingbasestatus: okcipher: nonedb (current)wal archive min/max (V008R006C008B0014): 000000490000000300000040/00000049000000030000004Afull backup: 20241101-105946Ftimestamp start/stop: 2024-11-01 10:59:46 / 2024-11-01 10:59:53wal start/stop: 000000490000000300000048 / 000000490000000300000048database size: 378.4MB, database backup size: 378.4MBrepo1: backup set size: 378.4MB, backup size: 378.4MBincr backup: 20241101-105946F_20241101-110356Itimestamp start/stop: 2024-11-01 11:03:56 / 2024-11-01 11:03:58wal start/stop: 00000049000000030000004A / 00000049000000030000004Adatabase size: 378.8MB, database backup size: 25.1MBrepo1: backup set size: 378.8MB, backup size: 25.1MBbackup reference list: 20241101-105946F
2、备库备份
[kingbase@node202 bin]$ /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_rman --config=/home/kingbase/kbbr_repo/sys_rman.conf --stanza=kingbase info
WARN: set process-max 4 is too large, auto set to CPU core count 1
stanza: kingbasestatus: okcipher: nonedb (current)wal archive min/max (V008R006C008B0014): 000000490000000300000040/0000004B000000030000004Dfull backup: 20241101-105544Ftimestamp start/stop: 2024-11-01 10:55:44 / 2024-11-01 10:55:53wal start/stop: 000000490000000300000046 / 000000490000000300000046database size: 378.4MB, database backup size: 378.4MBrepo2: backup set size: 378.4MB, backup size: 378.4MB
二、切换集群测试
1、执行failover切换后
[kingbase@node201 bin]$ ./repmgr cluster showID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------1 | node1 | standby | running | node2 | default | 100 | 73 | 0 bytes | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=90002 | node2 | primary | * running | | default | 100 | 74 | | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
2、执行switchover切换后
[kingbase@node201 bin]$ ./repmgr cluster showID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------1 | node1 | primary | * running | | default | 100 | 75 | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=90002 | node2 | standby | running | node1 | default | 100 | 74 | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
3、主库timeline时间线变化
如下所示,主库节点timeline变化:
三、主库数据恢复测试
1、模拟数据丢失
prod=# drop table t3;
DROP TABLE
prod=# drop table t2;
DROP TABLE
prod=# drop table t1;
DROP TABLE
prod=# \dList of relationsSchema | Name | Type | Owner
--------+-------------------------+-------------------+--------public | sys_roles | table | systempublic | sys_stat_statements | view | systempublic | sys_stat_statements_all | view | systempublic | tb1 | table | systempublic | teachers | table | system
(15 rows)
2、通过物理备份恢复数据
1)关闭集群
[kingbase@node201 bin]$ ./sys_monitor.sh stop
2)备份原data
[kingbase@node201 kingbase]$ mv data data.bk
3)执行全量恢复
如下所示,主库节点数据库恢复成功:
[kingbase@node201 bin]$ /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_rman --config=/home/kingbase/kbbr_repo/sys_rman.conf --stanza=kingbase restore
2024-11-01 11:14:17.027 P00 INFO: restore command begin 2.27: --band-width=0 --config=/home/kingbase/kbbr_repo/sys_rman.conf --exec-id=56849-a18b87e7 --kb2-host=192.168.1.202 --kb1-path=/home/kingbase/cluster/R6C8/HAC8/kingbase/data --kb2-path=/home/kingbase/cluster/R6C8/HAC8/kingbase/data --link-all --log-level-console=info --log-level-file=info --log-path=/home/kingbase/cluster/R6C8/HAC8/kingbase/log --log-subprocess --non-archived-space=1024 --process-max=4 --repo1-path=/home/kingbase/kbbr_repo --stanza=kingbase
WARN: set process-max 4 is too large, auto set to CPU core count 1
2024-11-01 11:14:17.065 P00 INFO: repo1: restore backup set 20241101-105946F_20241101-110356I, recovery will start at 2024-11-01 11:03:56
2024-11-01 11:14:17.981 P00 INFO: Restore Process: FILE: 1 / 4572 0% SZIE: 182419456 bytes / 397204747 bytes 174.0MB / 378.8MB 45%
........2024-11-01 11:14:23.434 P00 INFO: Restore Process: FILE: 4572 / 4572 100% SZIE: 397204747 bytes / 397204747 bytes 378.8MB / 378.8MB 100%
2024-11-01 11:14:23.435 P00 INFO: write updated /home/kingbase/cluster/R6C8/HAC8/kingbase/data/kingbase.auto.conf
2024-11-01 11:14:23.438 P00 INFO: restore global/sys_control (performed last to ensure aborted restores cannot be started)
2024-11-01 11:14:23.439 P00 INFO: restore size = 378.8MB, file total = 4572
2024-11-01 11:14:23.440 P00 INFO: restore command end: completed successfully (6417ms)
4)查看数据恢复状态
如下所示,主库节点数据库恢复完成:
# 启动主库数据库服务
[kingbase@node201 bin]$ ./sys_ctl start -D ../data# 访问数据库
[kingbase@node201 bin]$ ./ksql -U system test
Type "help" for help.test=# \c prod
You are now connected to database "prod" as userName "system".
prod=# \dList of relationsSchema | Name | Type | Owner
--------+-------------------------+-------------------+--------public | sys_roles | table | systempublic | sys_stat_statements | view | systempublic | sys_stat_statements_all | view | systempublic | t1 | table | systempublic | t2 | table | systempublic | t3 | table | systempublic | tb1 | table | systempublic | teachers | table | system
(18 rows)prod=# select count(*) from t3;count(*)
----------10000
(1 row)
四、执行备库恢复
如下所示,在执行备库clone前,将主库kingbase.auto.conf中的“restore_command”选项注释:
[kingbase@node201 bin]$ cat ../data/kingbase.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
.......# Recovery settings generated by sys_rman restore on 2024-11-01 11:14:23
# restore_command = '/home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_rman --config=/home/kingbase/kbbr_repo/sys_rman.conf --stanza=kingbase archive-get %f "%p"'
1、备库clone
[kingbase@node202 bin]$ ./repmgr standby clone -h 192.168.1.201 -U esrep -d esrep
2、启动备库数据库服务
[kingbase@node202 bin]$ ./sys_ctl start -D ../data
3、注册备库
[kingbase@node202 bin]$ ./repmgr standby register --force
4、查看集群状态
[kingbase@node201 bin]$ ./repmgr cluster showID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------1 | node1 | primary | * running | | default | 100 | 74 | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=90002 | node2 | standby | running | node1 | default | 100 | 74 | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
You have mail in /var/spool/mail/kingbase
---如上所示,集群恢复完成 !
五、恢复故障案例
1、备库clone后故障
如下所示,在主库kingbase.auto.conf中没有注释“restore_command”参数后,备库执行clone,然后启动数据库服务,备库从归档日志开始恢复数据,导致主备库数据时间线timeline不一致,主备流复制建立失败:
备库sys_log日志:
[kingbase@node202 sys_log]$ tail -1000 kingbase-2024-11-01_113351.csv
2024-11-01 11:33:51.683 CST,,,32336,,67244c1f.7e50,1,,2024-11-01 11:33:51 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
2024-11-01 11:33:51.688 CST,,,32338,,67244c1f.7e52,1,,2024-11-01 11:33:51 CST,,0,LOG,00000,"database system was interrupted; last known up at 2024-11-01 11:33:21 CST",,,,,,,,,""
2024-11-01 11:33:52.247 CST,,,32338,,67244c1f.7e52,2,,2024-11-01 11:33:51 CST,,0,LOG,00000,"restored log file ""0000004B.history"" from archive",,,,,,,,,""
2024-11-01 11:33:52.259 CST,,,32338,,67244c1f.7e52,3,,2024-11-01 11:33:51 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2024-11-01 11:33:52.269 CST,,,32338,,67244c1f.7e52,4,,2024-11-01 11:33:51 CST,,0,LOG,00000,"restored log file ""0000004B.history"" from archive",,,,,,,,,""
2024-11-01 11:33:52.289 CST,,,32338,,67244c1f.7e52,5,,2024-11-01 11:33:51 CST,,0,FATAL,XX000,"requested timeline 75 is not a child of this server's history","Latest checkpoint is at 3/51000058 on timeline 74, but in the history of the requested timeline, the server forked off from that timeline at 3/4D0000A0.",,,,,,,,""
2024-11-01 11:33:52.290 CST,,,32336,,67244c1f.7e50,2,,2024-11-01 11:33:51 CST,,0,LOG,00000,"startup process (PID 32338) exited with exit code 1",,,,,,,,,""
2024-11-01 11:33:52.290 CST,,,32336,,67244c1f.7e50,3,,2024-11-01 11:33:51 CST,,0,LOG,00000,"aborting startup due to startup process failure",,,,,,,,,""
2024-11-01 11:33:52.297 CST,,,32336,,67244c1f.7e50,4,,2024-11-01 11:33:51 CST,,0,LOG,00000,"database system is shut down",,,,,,,,,""
2、注释主库restore_command参数
3、备库执行clone
如下所示,备库执行clone后,集群恢复正常:
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 74 | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | node1 | default | 100 | 74 | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
六、总结
本文详细记录了,在集群环境下通过single-pro模式执行物理备份后的,集群恢复过程,可以用于数据库数据恢复的参考。