We have a k8s cluster setup but noticed that a few of our pods were in Error, Terminating or ContainerCreating states.
How do we figure out what caused these error and how to we correct the errors to make sure our status is Running.
We are running this on k8s version 1.17
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-13T18:07:54Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.1", GitCommit:"d224476cd0730baca2b6e357d144171ed74192d6", GitTreeState:"clean", BuildDate:"2020-01-14T20:56:50Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Current Status
Let’s see what our current status is…
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default hello-world-77b74d7cc8-f6p69 0/1 Error 0 39d
default hello-world-77b74d7cc8-fr6xp 0/1 Error 0 39d
kube-system contrail-agent-7cl7b 0/2 Terminating 14 28d
kube-system contrail-kube-manager-btmt9 0/1 Terminating 10 28d
kube-system coredns-6955765f44-2d26k 0/1 Error 5 39d
kube-system coredns-6955765f44-6q7c7 0/1 Error 3 39d
kube-system etcd-kubernetes-cluster-master 1/1 Running 20 39d
kube-system kube-apiserver-kubernetes-cluster-master 1/1 Running 24 39d
kube-system kube-controller-manager-kubernetes-cluster-master 1/1 Running 18 39d
kube-system kube-flannel-ds-amd64-29wf6 1/1 Running 4 39d
kube-system kube-flannel-ds-amd64-6845b 1/1 Running 2 39d
kube-system kube-flannel-ds-amd64-v6wpq 1/1 Running 1 39d
kube-system kube-proxy-cmgxr 1/1 Running 1 39d
kube-system kube-proxy-qrnlg 1/1 Running 1 39d
kube-system kube-proxy-zp2t2 1/1 Running 4 39d
kube-system kube-scheduler-kubernetes-cluster-master 1/1 Running 20 39d
kube-system metrics-server-694db48df9-46cgs 0/1 ContainerCreating 0 18d
kubernetes-dashboard dashboard-metrics-scraper-76585494d8-th7f6 0/1 Error 0 39d
kubernetes-dashboard kubernetes-dashboard-5996555fd8-n9v2q 0/1 Error 28 39d
olm catalog-operator-64b6b59c4f-7qkpj 0/1 ContainerCreating 0 41m
olm olm-operator-844fb69f58-hfdjk 0/1 ContainerCreating 0 41m
Determine the Reason for Pod Failure
So I did a search to see how to find the reason for pod failure and the kubernetes docs provided us with the answer.
Get info about a pod (remember to set the namespace):
$ kubectl get pod contrail-agent-7cl7b -n kube-system
NAME READY STATUS RESTARTS AGE
contrail-agent-7cl7b 0/2 Terminating 14 28d
Get more detailed info:
kubectl get pod contrail-agent-7cl7b -n kube-system --output=yaml
...
and check the status:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-01-28T15:01:22Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2020-01-28T15:50:13Z"
message: 'containers with unready status: [contrail-vrouter-agent contrail-agent-nodemgr]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2020-01-28T15:50:13Z"
message: 'containers with unready status: [contrail-vrouter-agent contrail-agent-nodemgr]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2020-01-28T14:46:11Z"
status: "True"
type: PodScheduled
Try check the lastState and see if there is an eror message also check the exit code of the container:
containerStatuses:
- containerID: docker://10d2e3518108cd47f36d812e2c33fa62d89743b4482dae77fd2e87fc09536140
image: docker.io/opencontrailnightly/contrail-nodemgr:latest
imageID: docker-pullable://docker.io/opencontrailnightly/contrail-nodemgr@sha256:3a73ee7cc262fe0f24996b7f910f7c135a143f3a94874bf9ce8c125ae26368d3
lastState: {}
name: contrail-agent-nodemgr
ready: false
restartCount: 1
started: false
state:
terminated:
exitCode: 0
finishedAt: null
startedAt: null
Unfortunately not much for us here.
So now try getting the logs for the container in the pod…with:
$ kubectl logs contrail-agent-7cl7b contrail-vrouter-agent -n kube-system
Error from server (BadRequest): container "contrail-vrouter-agent" in pod "contrail-agent-7cl7b" is terminated
Looks like you cannot get the logs from a terminated pod.
This is the reason why an external log agregator is recommended – like prometheus.