A Journey of Troubleshooting K8s

We have a k8s cluster setup but noticed that a few of our pods were in Error, Terminating or ContainerCreating states.

How do we figure out what caused these error and how to we correct the errors to make sure our status is Running.

We are running this on k8s version 1.17

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-13T18:07:54Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.1", GitCommit:"d224476cd0730baca2b6e357d144171ed74192d6", GitTreeState:"clean", BuildDate:"2020-01-14T20:56:50Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}

Current Status

Let's see what our current status is...

$ kubectl get pods -A
NAMESPACE              NAME                                                READY   STATUS              RESTARTS   AGE
default                hello-world-77b74d7cc8-f6p69                        0/1     Error               0          39d
default                hello-world-77b74d7cc8-fr6xp                        0/1     Error               0          39d
kube-system            contrail-agent-7cl7b                                0/2     Terminating         14         28d
kube-system            contrail-kube-manager-btmt9                         0/1     Terminating         10         28d
kube-system            coredns-6955765f44-2d26k                            0/1     Error               5          39d
kube-system            coredns-6955765f44-6q7c7                            0/1     Error               3          39d
kube-system            etcd-kubernetes-cluster-master                      1/1     Running             20         39d
kube-system            kube-apiserver-kubernetes-cluster-master            1/1     Running             24         39d
kube-system            kube-controller-manager-kubernetes-cluster-master   1/1     Running             18         39d
kube-system            kube-flannel-ds-amd64-29wf6                         1/1     Running             4          39d
kube-system            kube-flannel-ds-amd64-6845b                         1/1     Running             2          39d
kube-system            kube-flannel-ds-amd64-v6wpq                         1/1     Running             1          39d
kube-system            kube-proxy-cmgxr                                    1/1     Running             1          39d
kube-system            kube-proxy-qrnlg                                    1/1     Running             1          39d
kube-system            kube-proxy-zp2t2                                    1/1     Running             4          39d
kube-system            kube-scheduler-kubernetes-cluster-master            1/1     Running             20         39d
kube-system            metrics-server-694db48df9-46cgs                     0/1     ContainerCreating   0          18d
kubernetes-dashboard   dashboard-metrics-scraper-76585494d8-th7f6          0/1     Error               0          39d
kubernetes-dashboard   kubernetes-dashboard-5996555fd8-n9v2q               0/1     Error               28         39d
olm                    catalog-operator-64b6b59c4f-7qkpj                   0/1     ContainerCreating   0          41m
olm                    olm-operator-844fb69f58-hfdjk                       0/1     ContainerCreating   0          41m

Determine the Reason for Pod Failure

So I did a search to see how to find the reason for pod failure and the kubernetes docs provided us with the answer.

Get info about a pod (remember to set the namespace):

$ kubectl get pod contrail-agent-7cl7b -n kube-system
NAME                   READY   STATUS        RESTARTS   AGE
contrail-agent-7cl7b   0/2     Terminating   14         28d

Get more detailed info:

kubectl get pod contrail-agent-7cl7b -n kube-system --output=yaml
...

and check the status:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-01-28T15:01:22Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-01-28T15:50:13Z"
    message: 'containers with unready status: [contrail-vrouter-agent contrail-agent-nodemgr]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-01-28T15:50:13Z"
    message: 'containers with unready status: [contrail-vrouter-agent contrail-agent-nodemgr]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-01-28T14:46:11Z"
    status: "True"
    type: PodScheduled

Try check the lastState and see if there is an eror message also check the exit code of the container:

  containerStatuses:
  - containerID: docker://10d2e3518108cd47f36d812e2c33fa62d89743b4482dae77fd2e87fc09536140
    image: docker.io/opencontrailnightly/contrail-nodemgr:latest
    imageID: docker-pullable://docker.io/opencontrailnightly/contrail-nodemgr@sha256:3a73ee7cc262fe0f24996b7f910f7c135a143f3a94874bf9ce8c125ae26368d3
    lastState: {}
    name: contrail-agent-nodemgr
    ready: false
    restartCount: 1
    started: false
    state:
      terminated:
        exitCode: 0
        finishedAt: null
        startedAt: null

Unfortunately not much for us here.

So now try getting the logs for the container in the pod...with:

$ kubectl logs contrail-agent-7cl7b contrail-vrouter-agent -n kube-system
Error from server (BadRequest): container "contrail-vrouter-agent" in pod "contrail-agent-7cl7b" is terminated

Looks like you cannot get the logs from a terminated pod.

This is the reason why an external log agregator is recommended - like prometheus.