¿Cómo aumentar la velocidad de reacción de Kubernetes ante la falla de los nodos del clúster?

Kubernetes está diseñado para ser robusto y resistente a fallas, y tiene la capacidad de recuperarse automáticamente. ¡Y lo hace todo bien! Sin embargo, los nodos de producción pueden perder la conexión con el clúster o fallar por varias razones. En estos casos, es imperativo que Kubernetes responda rápidamente al incidente.

, Kubernetes Kind . Kind Cluster , , .

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
- |
  apiVersion: kubelet.config.k8s.io/v1beta1
  kind: KubeletConfiguration
  nodeStatusUpdateFrequency: 1s
- role: control-plane
  - |
    kind: ClusterConfiguration
          node-monitor-period: 1s
          node-monitor-grace-period: 4s
- role: worker

deployment Nginx, control-plane worker. control-plane pod Ubuntu, Nginx, worker .


# create a K8S cluster with Kind
kind create cluster --config kind.yaml 
# create a Ubuntu pod in control-plane Node
kubectl run ubuntu --wait=true --image ubuntu --overrides='{"spec": { "nodeName": "kind-control-plane"}}' sleep 30d
# untaint control-plane node in order to schedule pods on it
kubectl taint node kind-control-plane node-role.kubernetes.io/master-
# create Nginx deployment with 2 replicas, one on each node
kubectl create deploy ng --image nginx
sleep 30
kubectl scale deployment ng --replicas 2
# expose Nginx deployment so that is reachable on port 80
kubectl expose deploy ng --port 80  --type ClusterIP
# install curl in Ubuntu pod
kubectl exec ubuntu -- bash -c "apt update && apt install -y curl"

Nginx, curl pod Ubuntu, control-plane, endpoints, Nginx .

# test Nginx service access from Ubuntu pod
kubectl exec ubuntu -- bash -c 'while true ; do echo "$(date +"%T.%3N") - Status: $(curl -s -o /dev/null -w "%{http_code}" -m 0.2 -i ng)" ; done'

# show Nginx service endpoints
while true; do  gdate +"%T.%3N"; kubectl get endpoints ng -o json | jq '.subsets' | jq '.[] | .addresses' | jq '.[] | .nodeName'; echo "------";done


, , Kind, . , NotReady.


# kill Kind worker node
echo "Worker down at $(gdate +"%T.%3N")"
docker stop kind-worker > /dev/null
sleep 15
# show when the node was detected to be down
echo "Worker detected in down state by Control Plane at "
kubectl get event --field-selector reason=NodeNotReady --sort-by='.lastTimestamp' -oyaml | grep time | tail -n1
# start worker node again
docker start kind-worker > /dev/null


, 12:50:22, Controller manager , 12:50:26, 4 .

Worker down at 12:50:22.285
Worker detected in down state by Control Plane at
      time: "12:50:26Z"

. 12:50:23, . 12:50:26.744 Kube Proxy endpoint, , .

12:50:23.115 - Status: 200
12:50:23.141 - Status: 200
12:50:23.161 - Status: 200
12:50:23.190 - Status: 000
12:50:23.245 - Status: 200
12:50:23.269 - Status: 200
12:50:23.291 - Status: 000
12:50:23.503 - Status: 200
12:50:23.520 - Status: 000
12:50:23.738 - Status: 000
12:50:23.954 - Status: 000
12:50:24.166 - Status: 000
12:50:24.385 - Status: 200
12:50:24.407 - Status: 000
12:50:24.623 - Status: 000
12:50:24.839 - Status: 000
12:50:25.053 - Status: 000
12:50:25.276 - Status: 200
12:50:25.294 - Status: 000
12:50:25.509 - Status: 200
12:50:25.525 - Status: 200
12:50:25.541 - Status: 200
12:50:25.556 - Status: 200
12:50:25.575 - Status: 000
12:50:25.793 - Status: 200
12:50:25.809 - Status: 200
12:50:25.826 - Status: 200
12:50:25.847 - Status: 200
12:50:25.867 - Status: 200
12:50:25.890 - Status: 000
12:50:26.110 - Status: 000
12:50:26.325 - Status: 000
12:50:26.549 - Status: 000
12:50:26.604 - Status: 200
12:50:26.669 - Status: 000
12:50:27.108 - Status: 200
12:50:27.135 - Status: 200
12:50:27.162 - Status: 200
12:50:27.188 - Status: 200

, Kubernetes . , , Kubernetes , , etcd, 1 . , 1000 , 60000 , etcd etcd.

, , . , .

