We have a k8s cluster setup but noticed that a few of our pods were in Error
, Terminating
or ContainerCreating
states.
How do we figure out what caused these error and how to we correct the errors to make sure our status is Running
.
We are running this on k8s version 1.17
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-13T18:07:54Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.1", GitCommit:"d224476cd0730baca2b6e357d144171ed74192d6", GitTreeState:"clean", BuildDate:"2020-01-14T20:56:50Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Current Status
Let’s see what our current status is…
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default hello-world-77b74d7cc8-f6p69 0/1 Error 0 39d
default hello-world-77b74d7cc8-fr6xp 0/1 Error 0 39d
kube-system contrail-agent-7cl7b 0/2 Terminating 14 28d
kube-system contrail-kube-manager-btmt9 0/1 Terminating 10 28d
kube-system coredns-6955765f44-2d26k 0/1 Error 5 39d
kube-system coredns-6955765f44-6q7c7 0/1 Error 3 39d
kube-system etcd-kubernetes-cluster-master 1/1 Running 20 39d
kube-system kube-apiserver-kubernetes-cluster-master 1/1 Running 24 39d
kube-system kube-controller-manager-kubernetes-cluster-master 1/1 Running 18 39d
kube-system kube-flannel-ds-amd64-29wf6 1/1 Running 4 39d
kube-system kube-flannel-ds-amd64-6845b 1/1 Running 2 39d
kube-system kube-flannel-ds-amd64-v6wpq 1/1 Running 1 39d
kube-system kube-proxy-cmgxr 1/1 Running 1 39d
kube-system kube-proxy-qrnlg 1/1 Running 1 39d
kube-system kube-proxy-zp2t2 1/1 Running 4 39d
kube-system kube-scheduler-kubernetes-cluster-master 1/1 Running 20 39d
kube-system metrics-server-694db48df9-46cgs 0/1 ContainerCreating 0 18d
kubernetes-dashboard dashboard-metrics-scraper-76585494d8-th7f6 0/1 Error 0 39d
kubernetes-dashboard kubernetes-dashboard-5996555fd8-n9v2q 0/1 Error 28 39d
olm catalog-operator-64b6b59c4f-7qkpj 0/1 ContainerCreating 0 41m
olm olm-operator-844fb69f58-hfdjk 0/1 ContainerCreating 0 41m
Determine the Reason for Pod Failure
So I did a search to see how to find the reason for pod failure and the kubernetes docs provided us with the answer.
Get info about a pod (remember to set the namespace):
$ kubectl get pod contrail-agent-7cl7b -n kube-system
NAME READY STATUS RESTARTS AGE
contrail-agent-7cl7b 0/2 Terminating 14 28d
Get more detailed info:
kubectl get pod contrail-agent-7cl7b -n kube-system --output=yaml
...
and check the status:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-01-28T15:01:22Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2020-01-28T15:50:13Z"
message: 'containers with unready status: [contrail-vrouter-agent contrail-agent-nodemgr]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2020-01-28T15:50:13Z"
message: 'containers with unready status: [contrail-vrouter-agent contrail-agent-nodemgr]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2020-01-28T14:46:11Z"
status: "True"
type: PodScheduled
Try check the lastState
and see if there is an eror message also check the exit code of the container:
containerStatuses:
- containerID: docker://10d2e3518108cd47f36d812e2c33fa62d89743b4482dae77fd2e87fc09536140
image: docker.io/opencontrailnightly/contrail-nodemgr:latest
imageID: docker-pullable://docker.io/opencontrailnightly/contrail-nodemgr@sha256:3a73ee7cc262fe0f24996b7f910f7c135a143f3a94874bf9ce8c125ae26368d3
lastState: {}
name: contrail-agent-nodemgr
ready: false
restartCount: 1
started: false
state:
terminated:
exitCode: 0
finishedAt: null
startedAt: null
Unfortunately not much for us here.
So now try getting the logs for the container in the pod…with:
$ kubectl logs contrail-agent-7cl7b contrail-vrouter-agent -n kube-system
Error from server (BadRequest): container "contrail-vrouter-agent" in pod "contrail-agent-7cl7b" is terminated
Looks like you cannot get the logs from a terminated pod.
This is the reason why an external log agregator is recommended – like prometheus.