What happens when one of your Kubernetes nodes fails?
What happens when one of your Kubernetes nodes fails?#
This section details what happens during a node failure and what is expected during the recovery.
Post node failure, in about 1 minute,
kubectl get nodes
will reportNotReady
state.In about 5 minutes, the states of all the pods running on the
NotReady
node will change to eitherUnknown
orNodeLost
.This is based on pod eviction timeout settings, the default duration is five minutes.Irrespective of deployments (StatefuleSet or Deployment), Kubernetes will automatically evict the pod on the failed node and then try to recreate a new one with old volumes.
If the node is back online within 5 – 6 minutes of the failure, Kubernetes will restart pods, unmount, and re-mount volumes.
If incase if evicted pod gets stuck in
Terminating
state and the attached volumes cannot be released/reused, the newly created pod(s) will get stuck inContainerCreating
state. There are 2 options now:Either to forcefully delete the stuck pods manually (or)
Kubernetes will take about another 6 minutes to delete the VolumeAttachment objects associated with the Pod and then finally detach the volume from the lost Node and allow it to be used by the new pod(s).
In summary, if the failed node is recovered later, Kubernetes will restart those terminating pods, detach the volumes, wait for the old VolumeAttachment cleanup, and reuse (re-attach & re-mount) the volumes. Typically these steps would take about 1 ~ 7 minutes.