logo

K8s Troubleshooting

... is attempting to grant RBAC permissions not currently held

Error:

Error from server (Forbidden): clusterroles.rbac.authorization.k8s.io "foo-cluster-role" is forbidden: user "[email protected]" (groups=["bar"]) is attempting to grant RBAC permissions not currently held:
{APIGroups:[""], Resources:["nodes"], Verbs:["list"]}

Solution: use kubectl patch to add the missing permission

$ kubectl patch clusterrole cluster-role-name \
  --kubeconfig ${KUBECONFIG} \
  --type='json' \
  -p='[{"op": "add", "path": "/rules/0", "value":{ "apiGroups": [""], "resources": ["nodes"], "verbs": ["list"]}}]'

If kubectl patch fails for the current user does not have the permission, so it cannot grant permission to this clusterrole.: Check your kubeconfig, if there's another context with higher permissions, use the context:

$ kubectl config use-context admin-context

Then patch again.

Object stuck in Terminating Status

Check the finalizers of the object. Objects will not be removed until its metadata.finalizers field is empty.

The target object remains in a terminating state while the control plane, or other components, take the actions defined by the finalizers.

https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/

message: 'The node was low on resource: ephemeral-storage.

Error

Pods are failing:

"message: 'The node was low on resource: ephemeral-storage."

Debug

Check disk usage

$ df -h

If the disk is indeed full, check what is taking up the disk spaces in /var/lib/kubelet or /var/log.

no kind is registered for the type ... in scheme ...

Add AddToScheme():

import (
  foov1 "path/to/foo/v1"
  runtimeutil "k8s.io/apimachinery/pkg/util/runtime"
)
runtimeutil.Must(foov1.AddToScheme(scheme))

"timed out waiting for cache to be synced"

Maybe missing CRD or RBAC.

failed to call webhook: the server could not find the requested resource

  • Check your ValidatingWebhookConfiguration CRs.
  • Check the Service of the webhook.
  • Check the Deployment of the webhook backend, see if it is up and running, and if it is busy dealing with something else.

Pod cannot be scheduled

Possible causes:

  • The cluster not having enough CPU or RAM available to meet the Pod's requirements.
  • 1 node(s) didn't have free ports for the requested pod ports.
  • Pod affinity or anti-affinity rules preventing it from being deployed to available nodes.
  • Nodes being cordoned due to updates or restarts.
  • The Pod requiring a persistent volume that's unavailable, or bound to an unavailable node.
  • User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope

How to troubleshoot:

  • check status of the Pod
  • check status of the Node
  • check log of kube-scheduler

Pod takes a long time to shutdown / SIGTERM is not properly handled

If the container uses /bin/sh -c ./startup.sh as its command, the shell process does not automatically handle the SIGTERM it receives when being asked to shut down, which means Kubernetes will ask the container to stop and then just wait until its timeout (20 minutes in this case) before sending the container a SIGKILL. In the meantime, the shell process is oblivious and doesn't know it should shut down.

To fix this, one way is to use Tini (https://github.com/krallin/tini):

with Tini, SIGTERM properly terminates your process even if you didn't explicitly install a signal handler for it.

For example, if you use Bazel to build the container image:

container_image(
    name = "docker_image",
    cmd = [
        "/bin/sh",
        "-c",
        "/startup.sh",
    ],
    # ...
)

Replace cmd with

cmd = [
    "/usr/bin/tini",
    "/startup.sh",
],