作為Citus團隊的一員（Citus橫向擴展Postgres，但這不是我們要做的全部），我從事pg_auto_failover已有相當一段時間了，我很高興我們現在已經將pgautofailover作為開源引入了，為您提供自動故障轉移和高可用性！

在設計pg_auto_failover時，我們的目標是：為Postgres提供易於設置的業務連續性解決方案，該解決方案實現系統中任何一個節點的容錯能力。關於pg_auto_failover架構的文檔章節包括以下內容：

重要的是要了解pgautofailover已針對業務連續性進行了優化。萬一丟失單個節點，由於PostgreSQL同步複製，pgautofailover能夠繼續PostgreSQL服務，並在這樣做時防止任何數據丟失。

pg_auto_failover簡介

用於Postgres的pg_auto_failover解決方案旨在提供一種易於設置且可靠的自動化故障轉移解決方案。該解決方案包括由軟體驅動的決策，以決定何時在生產中實施故障轉移。

任何自動故障轉移系統中最重要的部分是決策策略，我們在線上有完整的文檔章節，內容涉及pgautofailover故障容忍機制。

使用pgautofailover時，將部署多個活動代理來跟蹤您的生產Postgres安裝屬性：

監視器是一個本身具有pg_auto_failover擴展名的Postgres資料庫，它註冊並檢查活動Postgres節點的運行狀況。
在pg_auto_failover監視器中註冊的每個Postgres節點也必須運行本地代理pg_autoctl運行服務。
每個受管理的Postgres服務在同一個組中有兩個設置在一起的Postgres節點。一個監視器設置可以根據需要管理多個Postgres組。

通過這樣的部署，監控器會定期連接到每個已註冊的節點（默認為20秒），並在其pgautofailover.node表中註冊成功或失敗。

除此之外，每個Postgres節點上的pg_autoctl運行服務還會檢查Postgres是否正在運行，並監視其他節點的pgstatreplication統計信息。此Postgres系統視圖使我們的本地代理能夠發現主節點和備用節點之間的網絡連接。本地代理定期每隔5s向監視器報告每個節點的狀態，除非需要進行轉換，然後立即進行。

pg_auto_failover監視器根據集群中兩個節點的已知狀態做出決策，並且僅遵循我們精心設計以確保節點收斂的有限狀態機。特別是，只有在pg_autoctl代理報告成功實現了確定的過渡到新狀態後，FSM才取得進展。關於故障轉移邏輯的體系結構文檔部分包含FSM的映像，我們使用這些映像來確保pgautofailover中的自動故障轉移決策。

pg_auto_failover快速入門

再一次，請參閱pg_auto_failover的「快速入門」文檔部分以獲取更多詳細信息。首次嘗試該項目時，最簡單的方法是創建一個監視器，然後註冊一個主要的Postgres實例，然後註冊一個輔助的Postgres實例。

下面列出了一些Shell命令，這些命令在localhost上都實現了簡單的部署，以用於項目發現。

監控器

在第一個終端，終端選項卡，螢幕或tmux窗口中，運行以下命令來創建監視器，包括使用initdb初始化Postgres集群，安裝我們的pg_auto_failover擴展以及在HBA文件中打開連接特權。

首先，我們在終端中準備環境：

$ mkdir /tmp/pg_auto_failover/test

$ export PGDATA=/tmp/pg_auto_failover/test/monitor

然後，我們可以使用剛剛準備的PGDATA環境設置在本地埠6000上的本地主機上創建Monitor Postgres實例：

$ pg_autoctl create monitor --nodename localhost --pgport 6000

12:12:53 INFO Initialising a PostgreSQL cluster at "/tmp/pg_auto_failover/test/monitor"

12:12:53 INFO Now using absolute pgdata value "/private/tmp/pg_auto_failover/test/monitor" in the configuration

12:12:53 INFO /Applications/Postgres.app/Contents/Versions/10/bin/pg_ctl --pgdata /tmp/pg_auto_failover/test/monitor --options "-p 6000" --options "-h *" --wait start

12:12:53 INFO Granting connection privileges on 192.168.1.0/24

12:12:53 INFO Your pg_auto_failover monitor instance is now ready on port 6000.

12:12:53 INFO pg_auto_failover monitor is ready at postgres://autoctl_node@localhost:6000/pg_auto_failover

12:12:53 INFO Monitor has been succesfully initialized.

現在我們可以將連接字符串重新顯示到監視器：

$ pg_autoctl show uri

postgres://autoctl_node@localhost:6000/pg_auto_failover

Postgres主節點

在另一個終端（選項卡，窗口，以通常的方式進行操作）中，現在創建一個主要的PostgreSQL實例：

$ export PGDATA=/tmp/pg_auto_failover/test/node_a

$ pg_autoctl create postgres --nodename localhost --pgport 6001 --dbname test --monitor postgres://autoctl_node@localhost:6000/pg_auto_failover

12:15:27 INFO Registered node localhost:6001 with id 1 in formation "default", group 0.

12:15:27 INFO Writing keeper init state file at "/Users/dim/.local/share/pg_autoctl/tmp/pg_auto_failover/test/node_a/pg_autoctl.init"

12:15:27 INFO Successfully registered as "single" to the monitor.

12:15:28 INFO Initialising a PostgreSQL cluster at "/tmp/pg_auto_failover/test/node_a"

12:15:28 INFO Now using absolute pgdata value "/private/tmp/pg_auto_failover/test/node_a" in the configuration

12:15:28 INFO Postgres is not running, starting postgres

12:15:28 INFO /Applications/Postgres.app/Contents/Versions/10/bin/pg_ctl --pgdata /private/tmp/pg_auto_failover/test/node_a --options "-p 6001" --options "-h *" --wait start

12:15:28 INFO CREATE DATABASE test;

12:15:29 INFO FSM transition from "init" to "single": Start as a single node

12:15:29 INFO Initialising postgres as a primary

12:15:29 INFO Transition complete: current state is now "single"

12:15:29 INFO Keeper has been succesfully initialized.

此命令將PostgreSQL實例註冊到監視器，使用pg_ctl initdb創建實例，為監視器運行狀況檢查準備一些連接權限，並為您創建一個名為test的資料庫。然後，執行由監視器排序的第一個轉換，從狀態INIT到達狀態SINGLE。

現在，我們仍在測試中，因此在終端中以交互方式啟動pg_autoctl運行服務。對於生產設置，這將進入需要引導時間的系統服務，例如systemd。

$ pg_autoctl run

12:17:07 INFO Managing PostgreSQL installation at "/tmp/pg_auto_failover/test/node_a"

12:17:07 INFO pg_autoctl service is starting

12:17:07 INFO Calling node_active for node default/1/0 with current state: single, PostgreSQL is running, sync_state is "", WAL delta is -1.

最後一行將每5s重複一次，這表明主節點運行狀況良好，並且可以正常連接到監視器。而且，它現在處於SINGLE狀態，一旦新的Postgres節點加入該組，它就會改變。

Postgres輔助節點

現在是時候在另一個終端上創建輔助Postgres實例了：

$ export PGDATA=/tmp/pg_auto_failover/test/node_b

$ pg_autoctl create postgres --nodename localhost --pgport 6002 --dbname test --monitor postgres://autoctl_node@localhost:6000/pg_auto_failover

12:21:08 INFO Registered node localhost:6002 with id 5 in formation "default", group 0.

12:21:09 INFO Writing keeper init state file at "/Users/dim/.local/share/pg_autoctl/tmp/pg_auto_failover/test/node_b/pg_autoctl.init"

12:21:09 INFO Successfully registered as "wait_standby" to the monitor.

12:21:09 INFO FSM transition from "init" to "wait_standby": Start following a primary

12:21:09 INFO Transition complete: current state is now "wait_standby"

12:21:14 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby

12:21:14 INFO The primary node returned by the monitor is localhost:6001

12:21:14 INFO Initialising PostgreSQL as a hot standby

12:21:14 INFO Running /Applications/Postgres.app/Contents/Versions/10/bin/pg_basebackup -w -h localhost -p 6001 --pgdata /tmp/pg_auto_failover/test/backup -U pgautofailover_replicator --write-recovery-conf --max-rate 100M --wal-method=stream --slot pgautofailover_standby ...

12:21:14 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete

pg_basebackup: checkpoint completed

pg_basebackup: write-ahead log start point: 0/2000028 on timeline 1

pg_basebackup: starting background WAL receiver

32041/32041 kB (100%), 1/1 tablespace

pg_basebackup: write-ahead log end point: 0/20000F8

pg_basebackup: waiting for background process to finish streaming ...

pg_basebackup: base backup completed

12:21:14 INFO Postgres is not running, starting postgres

12:21:14 INFO /Applications/Postgres.app/Contents/Versions/10/bin/pg_ctl --pgdata /tmp/pg_auto_failover/test/node_b --options "-p 6002" --options "-h *" --wait start

12:21:15 INFO PostgreSQL started on port 6002

12:21:15 WARN Contents of "/tmp/pg_auto_failover/test/node_b/postgresql-auto-failover.conf" have changed, overwriting

12:21:15 INFO Transition complete: current state is now "catchingup"

12:21:15 INFO Now using absolute pgdata value "/private/tmp/pg_auto_failover/test/node_b" in the configuration

12:21:15 INFO Keeper has been succesfully initialized.

這次向監視器的註冊返回了狀態WAITSTANDBY，該狀態驅動pgautoctl創建輔助節點。這是因為伺服器已存在於組中，並且當前為SINGLE。與此並行，監視器將目標狀態WAIT_PRIMARY分配給主節點，localpgautoctlagent將在其中從監視器資料庫和openpghba.conf中檢索新節點的節點名稱和埠以進行複製。完成後，輔助節點繼續pg_basebackup，安裝arecovery.conf`文件，啟動本地Postgres服務，並通知監視器有關達到目標狀態的信息。

不過，我們仍在CATCHING_UP。這意味著尚無法進行自動故障轉移。為了能夠安排故障轉移，我們需要在新節點上運行本地服務，監視Postgres的運行狀況和複製狀態，並每5秒向監視器報告一次：

$ pg_autoctl run

12:26:26 INFO Calling node_active for node default/5/0 with current state: catchingup, PostgreSQL is running, sync_state is "", WAL delta is -1.

12:26:26 INFO FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again

12:26:26 INFO Transition complete: current state is now "secondary"

12:26:26 INFO Calling node_active for node default/5/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.

現在，新節點處於SECONDARY狀態，並繼續向監視器報告，準備在監視器做出決定時提升本地Postgres實例。

使用pg_auto_failover進行自動和手動故障轉移

每個節點使用pg_auto_failover來配置具有自動故障轉移功能的PostgreSQL集群所需要做的就是：每個節點使用兩個命令：首先使用pg_autoctl create ...創建節點，然後運行pg_autoctl來運行本地服務，以實現由監視器決定的轉換。

要見證故障轉移，最簡單的方法是停止pg_autoctl運行服務（在運行它的終端中使用^ C或在其他任何地方使用pg_autoctl stop --pgdata ...；然後也使用pg_ctl停止Postgres實例- D ...停下來。

當僅停止Postgres時，pg_autoctl運行服務將檢測到該情況為異常，然後首先嘗試重新啟動Postgres。僅當使用默認pg_auto_failover參數連續3次未能啟動Postgres時，才認為故障轉移是適當的。

注入故障轉移條件的另一種方法是禮貌地要求監視器為您安排一個：

$ psql postgres://autoctl_node@localhost:6000/pg_auto_failover

> select pgautofailover.perform_failover();

應用程式和客戶端的連接字符串

整個設置以pg_auto_failover條款的形式運行。默認格式名為default，並且包含兩個Postgres實例的單個組。想法是只有一個入口，可以將應用程式連接到任何給定的形式。要獲取到我們的pg_auto_failover託管的Postgres服務的連接字符串，請發出以下命令，例如在監視器終端上：

$ pg_autoctl show uri --formation default

postgres://localhost:6002,localhost:6001/test?target_session_attrs=read-write

我們在這裡使用libpq的多主機功能。當它基於libpq（大多數都是這樣）時，可以與任何現代Postgres驅動程序一起使用，並且已知其他本地驅動程序也可以實現相同的功能，例如JDBC Postgres驅動程序。

當然，如果適用於psql：

$ psql postgres://localhost:6002,localhost:6001/test?target_session_attrs=read-write

psql (12devel, server 10.7)

Type "help" for help.

test# select pg_is_in_recovery();

pg_is_in_recovery

═══════════════════

(1 row)

當使用這樣的連接字符串時，連接驅動程序將連接到第一台主機並檢查是否接受寫操作，如果不是，則連接到第二台主機並再次檢查。那是因為我們說過我們希望targetsessionattrs是可讀寫的。

使用核心Postgres的此功能，我們實現了客戶端的高可用性：在發生故障轉移的情況下，我們的node_b將成為主要對象，並且我們需要應用程式現在將node_b定位為寫入對象，並且該操作將在連接驅動程序中自動完成水平。

高可用性，容錯和業務連續性

因此pgautofailover就是關於業務連續性的，並且為每個主要的Postgres伺服器使用一個備用伺服器。

在用於Postgres的經典HA設置中，我們依靠每個主伺服器都有兩個備用伺服器的同步複製。當您想要實現零或接近零的RTO和RPO目標時，這就是預期的架構。

同樣，每個主節點使用兩個備用節點的想法是，您會丟失任何備用伺服器，並且仍然知道在兩個不同的地方仍可以使用數據，因此仍然樂於接受寫入。這在許多生產設置中都是非常好的屬性，並且是其他現有Postgres HA工具的目標。

在某些情況下，最佳的生產設置折衷方法與當前Postgres HA工具支持的方法有所不同。有時可以在需要執行災難恢復過程時面對服務中斷，因為對這種情況下必要風險的評估符合生產預算，預期的SLA或其組合。

並非所有項目都需要超過99.95％的可用性，即使沒有走到最後一英里，有時也需要達到99.999％的目標。此外，儘管物聯網和其他一些用例（例如龐大的用戶群）需要HA解決方案，這些解決方案需要將TB級數據擴展到PB級數據，但許多項目卻是針對較小的受眾和數據集的。當您擁有千兆位元組的數據，甚至數十千兆位元組的數據時，災難恢復的時機也將不再可能被吞噬，具體取決於您的SLA條款。

數據可用性

pg_auto_failover使用PostgreSQL同步複製來確保在故障轉移操作時沒有數據丟失。 sync rep Postgre功能可確保當客戶端應用程式收到來自Postgres的COMMIT消息時，數據便將其發送到我們的輔助節點。

面對系統中任何一個ONE節點丟失的情況，pg_auto_failover可以正常工作。如果丟失了主伺服器，然後又丟失了輔助伺服器，那麼除了備份之外，什麼都沒有。使用pg_auto_failover時，對於一次丟失多台伺服器的情況，您仍然必須設置適當的災難恢復解決方案。是的，這發生了。

還請注意臭名昭著的_file系統是否已滿_，由於我們習慣於部署類似規格的伺服器，因此它喜歡同時攻擊主伺服器和輔助伺服器……

結論

微軟在這裡的整個Citus團隊都對pg_auto_failover擴展的開源版本感到興奮。我們根據Postgres開放原始碼許可發布了pg_auto_failover，因此您可以以與部署Postgres完全相同的能力享受我們的貢獻。該項目是完全開放的，歡迎每個人參與並在我們的GitHub存儲庫上為https://github.com/citusdata/pg_auto_failover做出貢獻。我們正在遵循Microsoft開放原始碼行為準則，並確保所有人都受到歡迎和聆聽。

我的希望是，由於有了pg_auto_failover，你們中的許多人現在將能夠使用自動故障轉移解決方案在生產中部署Postgres。

原文：https://www.citusdata.com/blog/2019/05/30/introducing-pg-auto-failover/

本文：http://jiagoushi.pro/node/922

討論:請加入知識星球或者微信圈子【首席架構師圈】

「Postgres擴展」pg_auto_failover支持高可用性和自動故障轉移

pg_auto_failover簡介

pg_auto_failover快速入門

監控器

高可用性，容錯和業務連續性

數據可用性

結論

「三分鐘學習軟體工程」軟體設計概述

「Postgres擴展」pg_auto_failover支持高可用性和自動故障轉移

「PostgreSQL架構」PostgreSQL中的物化視圖與匯總表

「PostgreSQL 架構」PostgreSQL 11和即時編譯查詢

「PostgreSQL 」PostgreSQL 12的8大改進，性能大幅度提升

「PostgreSQL架構」為什麼關係型資料庫是分布式資料庫的未來

「文檔資料庫」Apache Couchdb 最終一致性

「資料庫架構」事務隔離級別和髒讀快速入門

「資料庫選型」再見MongoDB，您好PostgreSQL

「文檔資料庫選型」從MongoDB遷移到Apache CouchDB

「搜尋引擎選型」Solr vs. Elasticsearch：選擇開源搜尋引擎

「企業風險管理」OSS(開源)許可戰爭，第1部分

「容器雲架構」設置高可用性Kubernetes Master

「數據遷移」pgloader從各種來源加載數據到PostgreSQL

「首席架構師推薦」精選數據分析軟體列表

「首席架構師推薦」精選內容管理系統列表

「架構選型」精選繪圖軟體

【雲原生】節儉K8s Operators第3部：利用Knative縮減到零的能力

【雲原生架構】節儉K8s Operator 第2部分：將控制器縮放到零

「無伺服器架構」Apache Openwhisk 概覽

「數據架構」5分鐘學會數據流程圖：客戶服務系統示例

「數據架構」數據流程圖：實例-訂餐系統

「集成架構」SAP BW/4HANA中的ETL集成選項說明

「企業事件樞紐」Apache Kafka支持ACID事務嗎？