Instalación de patroni y estolones y práctica de caídas. Maxim Milyutin



Patroni y Stolon son dos de las soluciones más famosas y avanzadas para la orquestación de PostgreSQL y los clústeres de configuración Leader-Followers de alta disponibilidad (auto-failover). Sin embargo, los ingenieros que migran desde viejas soluciones probadas (Corosync y Pacemaker) e integradas desde otros DBMS enfrentan dificultades para instalar estas herramientas y una falta de comprensión de las funciones de cada uno de los componentes. En esta clase magistral se considerará el proceso típico de instalación de clústeres Patroni y Stolon en máquinas virtuales (no en contenedores), así como el comportamiento de estos clústeres con diversas fallas en la infraestructura. Todo el proceso se demostrará en tres máquinas virtuales que ejecutan vagrant utilizando imágenes prediseñadas. Si lo desea, el oyente puede seguir el proceso, habiendo preparado previamente su entorno.



PGConf.Russia



! . Ozon . . Postgres Pro Patroni Stolon. .





-. , Stolon, Patroni . .



, Ansible , Postgres Pro , .



Patroni , , — https://github.com/vitabaks/postgresql_cluster. .





, .



  • PostgreSQL – shared-nothing, .
  • . , .
  • hot standby, . . .
  • :
  • pg_basebackup , , .
  • . standby .
  • pg_rewind, standby.
  • , .




https://eng.uber.com/mysql-migration/



https://github.com/sorintlab/stolon/issues/519



https://github.com/zalando/patroni/issues/538



  • 10- PostgreSQL . , , , . , , , . Write amplification, - , , WAL full page images, checkpoint. hit beat . . WAL. « PostgreSQL MySQL» .



  • .



  • , , DDL, sequence, , , . WAL. WAL -. GTID MySQL, CSN MS SQL Server.



  • pg_rewind.



  • Stolon Patroni , , , rolling upgrade Postgres .







, ? , . . - , health checks - .





, , – promote . .





, , , .





, ? , promote . .





, split brain . - , .



, , , .



, . .





? Postgres , . , , , , .





? , , , - .



– . , read only. .





fail. , . , .





https://github.com/citusdata/pg_auto_failover



https://github.com/citusdata/pg_auto_failover/issues/12#issuecomment-490551255



. . pg_auto_failover Citus Data.



. , . pg_stat_replication.





, . . , , . primary ( ) , .



, , . , , .



fail. , .





, , .





, . .



, . , , .



.





, . DCS (Distributed Configuration System – ). IP , .



DCS – Consul, Etcd, Raft Zookeeper, Zab. Zab – Paxos.



, DCS.



Patroni/ Stolon.



Postgres Postgres .





, Patroni/ Stolon.



  • -, autofailover. - .
  • . PostgreQSL.
  • , Kubernetes.
  • DBaaS (database as a service).
  • – . , - . , - .


(DCS) Etcd





https://raft.github.io/



. DCS. . , «» . DCS, , .



? . , Postgres, , DCS , , split , split brain. , fail DCS .



, DCS 3-5-7 , , 3- . ? . net split, , DCS.



Etcd RAFT . .





DCS , follower PostgreSQL. RAFT.



. . .



, . follower, . . - RTT fsync.



, follower, . , , . . .



, - .



14 42 .



vagrant status
Current machine states:

node1                     running (virtualbox)
node2                     running (virtualbox)
node3                     running (virtualbox)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.


. vagrant.





: , . , , . . .





. . , .





Etcd . Etcd , Etcd.





config Etcd. , Etcd, , IP , . . ETCD_LISTEN_CLIENT_URLS . ETCD_LISTEN_PEER_URLS .



ETCD_ADVERTISE_CLIENT_URLS ETCD_INITIAL_ADVERTISE_PEER_URLS. . discovery, .



: ETCD_HEARTBEAT_INTERVAL ETCD_ELECTION_TIMEOUT.





. . . Ansible. . , .





. Etcd .





, term 2. Term – timeline PostgreSQL. term .





etcdctl member list. , () , followers.



sudo pkill -STOP etcd


. , fail , . Etcd , . . .





. . , term.





, , .





«etcdctl cluster-health». , . .





Etcd. , . term follower’.





- . . ? – . Etcd . «comcast». API tables Etcd. , .



? «Comcast — - device eth1 – packet – loss 100 %».





. , . time line. , -, . , term 4.





. , heartbeat_interval election_timeout. , followers , heartbeat , followers , . follower heartbeat - - -, . .



, , - . , . heartbeat_interval – 100 . , -, . election_timeout – .





. . , , RTT , election_timeout. Election_timeout . Ansible. .



`comcast --device eth1 --stop



: comcast --device eth1 --latency 600. .





latency 600 . 600 – . RTT 200 .





ping . RTT 1 .





. , term . . , - , term. .





, heartbeat_interval election_timeout. , heartbeat , election_timeout 10 . Ansible. . Etcd-config. , . , . . , -. Etcd .





. . follower’.





member list, , , fallowers .



, , , , - 10 .





- Etcd, . bar. Deadline exceeded – , , . Etcd. timeout . 5 . total_timeout , 10 .





«get», . -. .





. , .





. Election_timeout , heartbeat 100 .



, RAFT - . , : , , . .





. Etcdctl member list. . – follower.





. bash – comcast – device, . . . - sleep . Comcast – device eth1 – stop sleep 1,5. done . , , . .





Etcd. , term , , - , term, . Term . . . .





, , Etcd, . 1 . , . . . , , Etcd fsync . , .



. Comcast – device eth1 – stop.





https://github.com/etcd-io/etcd/blob/master/Documentation/tuning.md#time-parameters

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#example-hardware-configurations



. Etcd , .



, , .



, Etcd , , , , .



Patroni Stolon. , .





. netsplit, , DCS. , , Postgres , DCS . , Postgres, .



DCS. , . . . , DCS Patroni Stolon.





Stolon.





. DCS stolon-sentinel, . DCS : election, , statefull .



Postgres’ Stolon-keeper. – stolon-proxy, .





https://github.com/sorintlab/stolon/issues/313



3- , . , , 2- sentinel, . , 2- stolon-proxy. , , 2- Stolon-keeper, postgres-.



41 20



, . . , . , stolon’ . -, . Etcd . . . . – superuser, . . , .



Stolon. stolon.d/test-cluster.conf. , «test.cluster» . , , . Postgres, -. ,



- . , . Superuser, Stolon-keeper . . . , .



«test.cluster»? system/system/Stolon-keeper@.service. template-, . , - . ? Stolon, … . , , - , -, .



Ansible. . . , . . . Stolon-keeper. Name=Stolon-keeper@test-cluster state=started enable=on. .



. Test-cluster. , . lock - . , : Stolon-keeper, sentinel proxy . .



sentinel. . , , DCS. . . . sentinel , sentinel . . State=started enabled=on. - . , . , test-cluster. . , - . .



.





https://postgrespro.ru/docs/postgrespro/12/server-shutdown

https://github.com/sorintlab/stolon/issues/707



workflow Stolon:



  • «stolonctl init».
  • PostgreSQL pg_hba update.
  • , PostgreSQL , , , . . , Keeper, post-master. Stolon-keeper PostgreSQL.
  • «automaticPgRestart», postgres- .
  • , . , max_connections, max_lock_per_transaction postgres-. . , , «max_connections» «max_lock_per_transaction». , , , . .
  • – Stolon-keeper. – Stolon-keeper. . , .


, pg_pba. , pba. /opt/stolon/test-cluster. . . Stolon-test-cluster-spec.json. , . . , .



.





https://github.com/sorintlab/stolon/blob/master/doc/initialization.md

https://github.com/sorintlab/stolon/blob/master/doc/standbycluster.md



Stolon :



  • – .
  • – PITR, . standby cluster.
  • – existing. , DCS. DCS , , . , «existing».


. unitdb, checksums, , pgrewind . . Stolonctl. . .



Keeper, . . Keeper , sentinel, . , unitdb, . standby.



. «status». , Keepers, heaths check Keepers Postgres, . , , . sentinel.



, . wantedgeneration currentgeneration. Stolon-keeper . sentinel , , , . Keeper . .



. json, . . . Keepers . , , . , . . . , . Etcd .



. : Etcd . , Etcd. , . , , Consul. Consul , . , , , , Stolon-keeper . Postgres, Stolon . , Stolon-keeper. systemd, on abort, kill -9 .



Postgres. kill -9 , . . – . . Stolon-keeper, Ok.



. . - . Postgres . Stolon-keeper . Postgres. .



. fail. Postgres-. , . pgbench.



- , Postgres, ? select , , select.



, checksums, , checksums , . Postgres , . , , checksums , - . Postgres. Patroni/Stolon .



pgbench. . , . 25432. . . Stolon/test-cluster/postgres/pg_hba.conf.



, Stolon superuser, , . , .



. «default», . «pg_hba». «update». json- pgHBA . local all posters. Posters trust. – host all postgres 172.20.20.0/24 trust.



, . . , Postgres. . Create user postgres superuser. , Postgres . pg_bench . HBA user test. Patroni. .



while. 20 , . , . .





Stolon . :



  • SleepInterval – .
  • RequestTimeout – deadline PostgreSQL. Deadline DCS – 5 .
  • FailInterval – , sentinel , . Sentinel failInterval, , . , , , . . - , . . failInterval .




autofailover Stolon?



1 – fail . Stolon-keeper Postgres . sentinel. , sentinel. . sleepInterval. 10 .





2 – - , , sentinel. , Keeper .





3 – sentinel. Keepers. sleepInterval.



: (λ1 + λ2) * sleepInterval. . .





4 – . DCS. sentinel , .





, , DCS sentinel , failover 25 50 .



fail sentinel’, failover sentinel. sentinel. failover .





, Stolon-proxy Keeper , Keeper read only . Postgres. Postgres Stolon-proxy.



. DCS, , , , .





  • Stolon. Stolon . , DCS . , «deadKeeperRemovalInterval». 48 . , DCS. , . , , WAL. 48 , .



  • , Stolon . . , -, deadlines - Postgres. , dbWaitReadyTimeout deadline . – 60 . checkpoints, deadline .



  • syncTimeout – deadline . 30 . , . .



  • InitTimeout – deadline , initdb .



  • -. conversion timeout. , Keeper . -. Stolon . - -, Stolon .







Patroni.





Patroni, , , . ? Stolon. Patroni . DCS, , Patroni.





Patroni, . . , DCS time to live . , , . . , - . s… . Patroni , WAL-, REST API, , . WAL . Proxy – .





. . . 3- Etcd. Postgres Pro HAProxy confd, Etcd .



2- Patroni. Patroni Postgres.





https://patroni.readthedocs.io/en/latest/existing_data.html



Patroni , . basebackup’ . Patroni , , .



basebackup. , , tablespace.





https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-ADMIN



workflow Patroni. , bootstrap. , , . . Stolon, , , . bootstrap. .



Patroni? postgres.conf pg_hba.conf, recovery.conf DCS , Stolon. . .



Patroni postgres-. , , .



, – , Patroni.





https://github.com/zalando/patroni/blob/master/haproxy.cfg

https://github.com/zalando/patroni/tree/master/extras/confd



.



– . . . Patroni- REST API endpoints, , , .



HAProxy, healthchecks Patroni.



Patroni callbacks. , . .



HAProxy , DCS HAProxy. HAProxy + confd. consul-temlate. . .



10- Postgres libpq , , «target_session_attrs», . ? – , target_session_attrs.



, , watchdog Postgres, , , Patroni-. ? Postgres , . .



Stolon Patroni , . - , .





https://www.consul.io/docs/guides/forwarding.html

https://learn.hashicorp.com/consul/day-2-operations/advanced-operations/dns-caching

https://pgconf.ru/2019/242817 https://pgconf.ru/2019/242821

https://github.com/cybertec-postgresql/vip-manager



, DNS. Consul . DNS . .



IP-. HAProxy + keepalived. vip-manager, DCS, IP- , . , Postgres Pro , , IP-. , kill stop keepalived’, VRRP IP- HAProxy, IP- . , , . vip-manager. vip-manager , switchover, IP . , .





, , . Stolon :



  • ttl – .
  • Loop_wait – Patroni-.
  • Retry-timeout – DCS PostgreSQL.
  • Master_start_timeout- PostgreSQL ( Patroni-).


, , . , Patroni- Postgres, DCS. - loop_wait. , .



failover Patroni?





  • , DCS . Patroni- . . – 20-30 .

  • Patroni- REST API, endpoint Patroni WAL-. - 2 . 2 , , . , , WAL-.




  • DCS. - .




  • .




DCS , - , , 5 .





, , .



, , Patroni-. - - .





https://www.postgresql.org/message-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C@yandex-team.ru

https://github.com/zalando/patroni/blob/master/docs/watchdog.rst



.



  • . , , , Postgres. , Postgres WAL-commit .
  • Zalando – watchdog. Patroni- - , : , .
  • HAProxy Confd, . . , .
  • Corosync & Pacemaker — ( ) , . . . , , , .




, HAProxy Confd .





, netsplit? HAProxy Patroni . . health check’ Patroni-.



Confd. Confd , DCS.





, HAProxy PgBouncer. PgBouncer DCS. , , Patroni .





  • , Patroni . . , , - DCS . downtime , . wal_keep_stgments, .
  • , , . . . , , , . .
  • Patroni? Patroni Stolon , enterprise . :
  • . .
  • . , , . , , failover , - . Max Availability Oracle Data Guard.
  • PostgreSQL Stolon.


!



.



Etcd, . , - ?



-, . , , Etcd, Consul mail , . fsync , .



-? , . , , , , ?



, -.



, Postgres , . DCS , .



, .



? ? . , , Consul . Etcd . - ?



Consul Etcd. RAFT. fsync . Postgres DCS , , . , . . , .



, !



Zookeeper? , ? ?



Zookeeper , . Etcd . Stolon , Patroni – .



- Patroni? . - ?



. wal_keep_segments, . . WAL- , . , issue Patroni. , Stolon, , , - .



! -! , . Patroni , . , , .



, . . .



. . . , . WAL-, . , WALs . . , . !



. , - -. , . . . switchover failover, . promote checkpoint, WALs . . , , .



! , - - . , ?



enterprise, Patroni. Stolon. , . . -- Kubernetes, , . Keeper, Sentinel. , , .



Patroni . WAL-, , DCS. DCS . (, , ), DCS, . . issue, Consul . Patroni. Stolon . Kubernetes.



. , Stolon ?



.



– master-slave Stolon.



, . , standby . – Stolon, . , , standby .



. . .



, ?



, . . , , , .



, .



. . ?



, .



Patroni . , , , . , . .



, Patroni , . Stolon , Postgres keeper data, .



, Stolon ?



open source, .



, , ?



, issue. .



- , . - . , .



issue. , , , .



-, . .



! . . , HAProxy, . . . . HAProxy "on-marked-down shutdown-sessions", , .



, ? health checks?



, http check REST API.



, -, HAProxy, IP- . – PgBouncer, health checks. HAProxy – , health checks , . , , – Patroni, - .



Patroni Etcd REST API.



, Etcd , Etcd.



Etcd? , , . watchdog, Patroni , , , watchdog reboot.



, watchdog – . watchdog. Patroni PostgreSQL, Patroni. watchdog – , , . .



, .



watchdog -, .. , , Patroni- , reboot. .



watchdog , , , , failover , . .



, . ? .



, …



, Patroni, . . - . watchdog – .



Patroni Etcd , , standby. , watchdog .



. , Patroni , , , . . : watchdog, HAProxy.



.



Etcd. ?



-.



- ?



-.



? , ?



-.



. . , , ?



Si. Mencioné esto en una tesis para una configuración inestable. Y el tiempo fuera es la única forma. Estos son heartbeat_interval y choice_timeout en particular.



¡Gracias!






All Articles