K8s Kubernetes  incubating 

Architecture

graph LR;
c1(Cluster) --> n1(node1) --> p11[pod 1]
n1 --> p12[pod 2]
c1 --> n2(node2) --> p21[pod 1]

API docs - https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/

Insert diagram about control nodes + worker nodes + api servers etc.

k8s-arch

#refine https://kubernetes.io/docs/concepts/architecture/

Components

Metric Server

Use HostNetwork

Helps monitor pods Can run kubectl top pod to check resource usage for pods kubectl top nodes

CNCF

Project status

graph LR;
a(Sandbox
New) --> b(Incubating
More wide-spread adoption, active development) --> c[Graduated
mature, stable part of k8s core]

https://www.cncf.io/projects

Kubernetes Plugins?

CRIO

Kubernetes container runtime #readmore

CNI

Common network interface

Jaeger

Kubernetes Operator, manages packaging, deploying and managing applications

Rook

Storage Orchestrator

Cluster Autoscaler

https://github.com/kubernetes/autoscaler

Kubernetes Distributions

Rancher

Red Hat OpenShift

SUSE Containers as a Service

Kubernetes Managed Services

AWS Elastic Kubernetes Service

Azure Kubernetes Service

Google Kubernetes Engine

Certifications

CKAD

Developers

CKA

Admins

Kubernetes Dashboard

Link

Kubernetes Database

etcd essentially a key value pair

all k8s resources are stored in etcd in json format

json is not very human friendly, so yaml is the de-facto choice for k8s config files, which are called manifests.

A manifest file broadly contains -

apiVersion:     # v1, v1beta1, v1beta2 etc.
kind:           # pod, deployment, secret, configmap etc.
metadata:
  annotations:  # used for configurations sometimes
  labels:
  selector:        
  name:         # name of the object
  resourceVersion: # value changes with each update
data: # found in secret, and configmap objects
spec: # configs, varies by object, absent for some like secret, configmaps

kubectl

Config file ~/.kube/config

Structure of the config file, and the values that need to be specified -

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: xxxx
    server: https://172.22.28.5:6443
  name: kubernetes
contexts: # combination of cluster, username and namespace
- context:
    cluster: kubernetes
    user: kubernetes-admin
  name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
  user:
    client-certificate-data: xxxx
    client-key-data: xxxx

If this file is not present or has invalid details of a cluster, you might see an error like

$ kubectl get all
E1223 11:43:00.538822   14558 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused

This file is usually generated with help of /etc/kubernetes/admin.conf file from the control node. This file’s user is kube admin

Commands

kubectl create deployment can’t specify replicas! incorrect! see below kubectl create deployment my-dep --image=nginx --replicas=3

kubectl explain pod shows all the fields that are necessary to configure a pod. To deep dive into a particular property, use kubectl explain pod.Spec

To find out specific fields to specify for configuring [[K8S Scheduling#Node Affinity|nodeAffinity]], use kubectl explain pod.spec.affinity.nodeAffinity

Generate yaml from existing resources, use kubectl get <resource> -o yaml

Remember to cleanup the output (metadata, and status) as these should be added automatically when the resource is created

kubectl delete pod/podname --grace-period=0 --force to delete pod immediately

Troubleshooting and Debugging

When creating a pod, kubernetes first adds it to the etcd store.

kubectl describe can highlight problems if there is an issue during this initial step. Once the pod is added to etcd, it’s then started up.

kubectl describe pod <podname> Containers[].State shows the current state of containers in the pod

Containers:
  busybox:
    Container ID:   containerd://
    Image:          busybox
    Image ID:       docker.io/library/busybox@sha256:ver
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed  # this can hint at a problem where the container has exited after completing its task
      Exit Code:    0
      Started:      Sun, 10 Dec 2023 00:06:04 +0000
      Finished:     Sun, 10 Dec 2023 00:06:04 +0000
    Ready:          False
    Restart Count:  7
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sfpsw (ro)

Events section shows any errors including any errors like CrashLoopBackOff. Latest events are at the bottom.

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------  
  ... 
  Warning  BackOff    3m24s (x48 over 13m)  kubelet            Back-off restarting failed container busybox in pod mydep-8677c6d8bd-8c2c6_default(72a66cb8-e29a-4be3-ac99-8f88565327f3)

Once the pod is running, in addition to kubectl describe more information can be found out using the following commands.

kubectl get pods - high level view of pods in a namespace

kubectl get pod/pod-id -o yaml another way of getting similar info as kubectl describe, here the interesting field to watch is status.condition, and status.containerStatuses:

Restart a pod?

kubectl scale
kubectl rollout restart deployment name
kubectl delete pod name
kunectl replace -f pod/name

To find problems when a container/pod is running, use the following commands -

kubectl logs podname --all-containers get logs from all containers in pod podname

kubectl logs podname -c container get logs from container in pod podname

kubectl exec -it podname -- /bin/sh get a session into the container, if the container has a shell

Even if a container has a shell, you will find many of the regular utilities missing since images are usually optimized for runtime.

The [[Linux#Proc|proc]] file system can still help in such a case to find running processes etc.

Helpful debugging kubectl commands for most objects -

describe Show details of a specific resource or group of resources logs Print the logs for a container in a pod events List events

attach Attach to a running container exec Execute a command in a container cp Copy files and directories to and from containers port-forward Forward one or more local ports to a pod

proxy Run a proxy to the Kubernetes API server auth Inspect authorization debug Create debugging sessions for troubleshooting workloads and nodes

Problems with Nodes

kubectl get nodes shows which nodes are available and in a ready state

kubectl cordon - Use to mark node(s) unschedulable, can use selector. Use uncordon once the maintenance is done.

kubectl drain - Prepare node for maintenance by removing running pods gracefully and marking it unschedulable for new pods.

The behaviour differs based on how the pod is started on the node -

  • If controlled by a daemon-set, the pods are ignored! Since the daemonset controller ignores the unschedulable node state.
  • If controlled by deployment, replicat-set, stateful-set, job, replication controller, then drain will either evict the pods (if supported by API server), or delete them.
  • If there are standalone pods, these won’t be deleted or evicted unless --force flag is specified.

If a node is NOT_READY,

  • Check if kubelet is running on a node.
  • Check networking plugin is setup properly and running

Q: How to port forward to local, when running kubectl in docker?
A: start the kubectl container on docker, and expose a port

docker run -it --name kubectl -p 8000:8000 kubectl:latest

now run port forward as normal, but listen on 0.0.0.0 in addition to localhost.

kubectl -n workload port-forward svc/workload --address localhost,0.0.0.0 8000:8000

Kubernetes Objects

Diagram

![[k8s-objects.png]]

Deployments

Adds scalability, high availability, self healing capabilities to a pod by defining replication strategy and update strategy

kubectl create deployment my-dep --image=nginx --replicas=3

Example declaration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

kubectl rollout history deployments provides recent rollout events including reason for change (scale out/in not included)

kubectl rollout history deployment/my-app

Rollback a failed deployment to previous version kubectl rollout undo deployment/my-app --to-revision=1

Update Strategy

Specify rollingUpdate or recreate (can cause temporary )

DeploymentrollingUpdaterecreate
Notedeploys new replicaset, then removes old replicaset
Disruptionnoyes
Useful foradd examplesadd examples
apiVersion:
kind: Deployment
spec:
  strategy: rollingUpdate | recreate

ReplicaSet

Use labels to monitor pods. If you remove a label from the pod, see another come up within seconds, check the 1st and 2nd pod in output below.

root@controlplane:~$ kubectl get pods --show-labels
NAME                     READY   STATUS    RESTARTS   AGE     LABELS
my-dep-7674c564c-9t2wk   1/1     Running   0          6m39s   pod-template-hash=7674c564c,test=worksok
my-dep-7674c564c-gxzb7   1/1     Running   0          6m39s   app=my-dep,pod-template-hash=7674c564c
my-dep-7674c564c-svzgv   1/1     Running   0          4s      app=my-dep,pod-template-hash=7674c564c
my-dep-7674c564c-v4mnc   1/1     Running   0          6m39s   app=my-dep,pod-template-hash=7674c564c

DaemonSet

StatefulSet

Pods

Usually a group of containers, volume declarations

Smallest app building block in k8s, replicated across nodes to achieve the app’s desired availability, scalability, performance, capacity requirements.

Smallest unit of compute that can be deployed.

A Pod is similar to a set of containers with shared namespaces and shared filesystem volumes

Offers similar isolation as [[Containers]] using cgroups, namespaces etc.

Execute a command in a container contained in the pod - kubectl exec -it <podname> -c <container-name> -- /bin/sh

There no ntworking within a pod. Any containers running within a pod use the same IP.

Run a single stand alone pod, change it’s default image “command”, check the output using kubectl logs kubectl run busybox --image busybox --command -- nslookup kubernetes

kubectl run busybox --image busybox --command -- sleep 3600 kubectl exec busybox -it -- nslookup kubernetes

Double dash -- separates the kubectl command from the command you want to run in the container. Use -n namespace immediately after kubectl to avoid passing this argument to the container command instead.

#ask can you do this in a single step? run a pod/container, and get the output on command line?

Labels

Add identifying information to an object. This information can then be used to query and select objects. Labels help add information to objects that is relevant to users, so are useful in UI or CLI.

Labels allow users to map their own org structure on system resources. Things like environment, team etc.

A label key and value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores, up to 63 characters each.

Optionally, the key can begin with a DNS subdomain prefix and a single ‘/’, like example.com/my-app.

It appears under the metadata field

apiVersion:
kind:
metadata:
  label:
    app: myawesomeapp

List labels applied to a pod. By default, show-labels=false kubectl get pod/nginx --show-labels

Apply label to a pod kubectl label pod/podid newlabel=value

Remove an existing label from a pod kubectl label pod/podid newlabel- Note the trailing -

Update an existing label kubectl label pod/podid oldlabel=newvalue --overwrite without --overwrite flag label is not updated

If --overwrite is true, then existing labels can be overwritten, otherwise attempting to overwrite a label will result in an error.

Inspect labels applied to all objects kubectl get all --all-namespaces --show-labels

Use Selector flag to list only resources with a specific label kubectl get all --selector app=my-dep

Some labels are applied automatically, example on a [[#Namespace]], kubernetes.io/metadata.name=namespacename

If --resource-version is specified, then updates will use this resource version, otherwise the existing resource-version will be used. This resource-version available under metadata.resourceVersion.

Selector

Appears under spec

apiVersion:
kind:
metadata:
spec:
  selector:
    matchLabels:
      app: myawesomeapp

Use Selector flag to list only resources with a specific label kubectl get all --selector app=my-dep

Annotations

Add non-identifying information/metadata to objects. Annotations cannot be used to query and select objects. Information or metadata added as annotation to objects is mostly for use by machines ex - iam role annotations in case of IRSA

Deployment versions are added as annotations to the metadata field in the manifest yml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"

Labels vs Annotations

  • Labels = identifying information, Annotations = non-identifying information
  • Labels can be used to select objects or collection of objects, annotations cannot be used to identify or select objects
  • Annotations can contain characters not allowed by labels
PropertyLabelsAnnotationsNotes
Identifying informationyesnoa
Limited charactersyesnoa
Use with selectoryesnoa
User friendlyyesnoa

Namespace

Some objects are namespaced scoped while others are cluster wide

Objects can have same name across namespaces, but must be unique within a namespace

Provides isolation for resources

All namespaces

Use --all-namespace and -n flags to work with all, or a specific namespace kubectl [verb] [resource] --all-namespaces kubectl [verb] [resource] -A kubectl [verb] [resource] -n namespace

Existing namespaces

kubectl get ns on a fresh cluster will show these 4 existing namespaces

default           Active   48m
kube-node-lease   Active   48m
kube-public       Active   48m
kube-system       Active   48m

Namespace Issue

When creating a [[#Service]], a corresponding DNS entry like service.namespace.svc.cluster.local is created. Due to this, all namespace names must be valid DNS name.

To connect to a service in the same namespace, just specifying service is enough. It will be resolved locally within the same namespace. This is useful to launch multiple environments with the same config without much modifications.

To connect to a service in a different namespace, fully qualified name service.othernamespace.svc.cluster.local must be used.

[!danger] Be careful about namespaces matching public domain names.

Suppose, a namespace is named com, it contains a service called google. The local DNS name for it will be google.com.svc.cluster.local. If another service, foo in the same namespace tries to reach the public google.com, it will get resolved to the local google service instead.

Restrict permissions to create namespaces, and use admission controllers to further enforce this.

[!Test] Launch a service called landing in ai namespace, are other services in that space able to reach the public landing.ai service?

I wasn’t able to reproduce this behaviour :(

Update: I don’t see this happening with the busybox image, BUT this can be seen with the dnsutils image. All properties are exactly the same between both pods, so it might be down to the OS used in each image 🤷

$ kubectl -n ai get svc
NAME      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
landing   ClusterIP   10.97.252.177   <none>        80/TCP    21m

# this still resolves to the public landing.ai service
$ kubectl exec busybox -it -- nslookup landing.ai
Server:         10.96.0.10
Address:        10.96.0.10:53

Non-authoritative answer:
Name:   landing.ai
Address: 35.196.113.152

Non-authoritative answer:

# this resolves to the private landing.ai service
$ kubectl exec busybox -it -- nslookup landing.ai.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10:53

Name:   landing.ai.svc.cluster.local
Address: 10.97.252.177

$ kubectl exec -it busybox -- cat  /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

$ kubectl exec -it dnsutils -- nslookup launch.ai
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   launch.ai.svc.cluster.local
Address: 10.106.142.53

$ kubectl exec -it dnsutils -- nslookup launch
Server:         10.96.0.10
Address:        10.96.0.10#53

** server can\'t find launch: NXDOMAIN

$ kubectl exec -it dnsutils -- cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

Service

Almost like a virtual load balancer, connected to [[#Deployments]] using [[#Labels]]

Properties - id address, target port, and endpoints, session affinity?

It connects to the nodes which run kube-proxy. kube-proxy uses iptables to connect to the pods running on the nodes.

This service object ensures, the traffic is redirected to one of the pods.

kubectl get svc -A shows all services running in a cluster

kubectl expose creates a service by looking up a deployment, replica set, replication controller, pod or another service by name and using the selector of the resource.

kubectl expose deployment nginx --port=80 --target-port=8000

Port vs Target Port? target port is the port on the pod that the service target, port is the port that the service exposes

Cluster IP

default, internal access only

NodePort

ties a port of the node to the node of a pod, accessible from outside the cluster

LoadBalancer

Public cloud load balancers

ExternalName

uses DNS names, redirection happens at DNS level

Service without selector

use for direct connections based on ip/port, without an endpoint. Useful for databases and within namespaces

#ask Can I not use a service for resources with no labels?

#ask What is a headless service?

Ingress

Successor [[GatewayApi]]

Provides a http route from outside the cluster to services running in the cluster. It can also handle ssl termination, load balancing and name based virtual hosting.

Example ingress sending all traffic to a single service

graph LR;

client([client])-. Ingress-managed 
load balancer .->ingress[Ingress]; ingress-->|routing rule|service[Service]; subgraph cluster ingress; service-->pod1[Pod]; service-->pod2[Pod]; end classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff; classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; class ingress,service,pod1,pod2 k8s; class client plain; class cluster cluster;

Example ingress config

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: minimal-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: nginx-example
  rules:
  - http:
      paths:
      - path: /testpath
        pathType: Prefix
        backend:
          service:
            name: test
            port:
              number: 80

Ingress spec has rules which are matched against all incoming http requests, and the traffic is directed accordingly.

Ingress Annotations are often used to configure certain properties depending on the ingress controller in use.

If no host is specified in rules as in the example above, it matches all hosts.

Backend can also be a resource, but you cannot specify both resource and service for a path. resource backend is useful for directing requests for static assets to an object storage.

pathType can be one of Prefix, Exact, or ImplementationSpecific (upto the IngressClass)

For exposing arbitraty protocols and ports, [[NodePort]] or LoadBalancer service type can be used.

An ingress resource on its own doesn’t mean anything, it needs an [[Ingress Controller]] to be present on the cluster to provide the required functionality.

For handling TLS, the ingress spec should refer to a secret which provides the cert and secret key. For TLS to work properly, the host values in spec.tls.hosts must match spec.rules.host.

#find how is the ingress configured in the general eks cluster?

Blog post

Ingress Controller

Various options like nginx, aws alb, istio etc.

Each ingress controller implements a particular ingress class. For ex, for aws load balancer controller, it is alb. (ref)

Netowrking

Offical docs Design doc

Node contains pods which is controlled by a deployment, each pod has an IP. But

Service is connected to deployment using label

IP is a pod property, not container property, kubectl describe pod shows the IP assigned to a pod, or use kubectl get pods -o wide

4 major problems -

  1. container to container communication - handled by [[##Pods|pod]], localhost communication
  2. pod to pod communication - explained below
  3. pod to service communication - handled by [[#Service|services]]
  4. external to service communication - handled by [[#Service|services]]

Plugin

When changing a network plugin - ensure the network cidr stays the same

DNS

# service
NAMESPACE     NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-system   kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   33s
#pods
NAMESPACE      NAME                                   READY   STATUS    RESTARTS   AGE
kube-system    coredns-5dd5756b68-2l28z               1/1     Running   0          33s
kube-system    coredns-5dd5756b68-t55kw               1/1     Running   0          33

What objects get dns names?

  1. [[#Service]] service.namespace.svc.cluster.local
  2. [[#Pods]] pod-ipv4.namespace.pod.cluster.local

Each pod has a dns policy defined under pod.spec.dnsPolicy, value is either of

  • Default, inherits from the node
  • ClusterFirst, any query not matching cluster domain is forwarded to upstream DNS servers.
  • ClusterFirstWithHostNet, for pods running with hostNetwork: true.
  • None, specify dns configs under pod.spec.dnsConfig

Note: default is NOT the default dns policy. If no policy is used ClusterFirst is used.

do an nsloop on kubernetes

pod/busybox created
$ kubectl logs pod/busybox
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find kubernetes.cluster.local: NXDOMAIN
** server can't find kubernetes.cluster.local: NXDOMAIN

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

** server can't find kubernetes.svc.cluster.local: NXDOMAIN
** server can't find kubernetes.svc.cluster.local: NXDOMAIN

This provides the ip of the kubernetes service which can be verified using kubectl describe svc/kubernetes

Note: lookup only works within the namespace. Outside the namespace, you won’t get the result!

$ kubectl run dnsnginx --image busybox --command -- nslookup nginx
pod/dnsnginx created
$ kubectl run dnskube --image busybox --command -- nslookup kube-dns
pod/dnskube created
$ 
$ kubectl logs pod/dnsnginx
Server:         10.96.0.10
Address:        10.96.0.10:53

** server can't find nginx.cluster.local: NXDOMAIN
** server can't find nginx.cluster.local: NXDOMAIN
** server can't find nginx.default.svc.cluster.local: NXDOMAIN
** server can't find nginx.default.svc.cluster.local: NXDOMAIN

$ kubectl -n nginx run dnsnginx --image busybox --command -- nslookup nginx 
pod/dnsnginx created
root@controlplane:~$ kubectl logs pod/dnsnginx -n nginx
Server:         10.96.0.10
Address:        10.96.0.10:53

Name:   nginx.nginx.svc.cluster.local
Address: 10.99.230.232

** server can't find nginx.cluster.local: NXDOMAIN
** server can't find nginx.cluster.local: NXDOMAIN

Why is that? Check the dns config inserted into a pod -

$ kubectl -n nginx exec -it dnsnginx -- /bin/sh
/ # cat /etc/resolv.conf 
search nginx.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

The name server 10.96.0.10 points to the kube-dns service running in the kube-system namespace

$ kubectl get svc -A
NAMESPACE     NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
default       kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP                  52m
kube-system   kube-dns     ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP   52m
nginx         nginx        ClusterIP   10.99.230.232   <none>        80/TCP                   43m

Q: How to connect to a service running in namespace B if it can’t be queried from pods in namespace A? A: Service name can be queried using the format servicename.namespace from any namespace in the cluster

https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

Storage

kubectl explain pod.spec.volumes shows the different volume types that are available for use.

Volumes

Volumes can be ephermal or persistent.

To use a volume within a pod’s containers, you need to specify spec.volumes and spec.containers[*].volumeMounts. The container so created sees the data contained in the image + any data mounted as a volume.

Specified for a pod in spec.volumes, to check all the available configuration options, use kubectl explain pod.spec.volumes

Volume types were cloud specific which have now been deprecated in favor of 3rd party [[##storage drivers]] instead. The following volume types are still valid -

  • Secret (always mounted as RO, don’t use as subpath to receive updates)
  • ConfigMap (always mounted as RO, don’t use as subpath to receive updates)
  • Local, Empty Dir, Host Path relate to local filesystems of the node.
  • PVC
  • Projected
  • Downward API - check coredns pods
graph LR;
subgraph pod
 subgraph container1
  m1[volMount]
 end
 subgraph container2
  m2[volMount]
 end
 subgraph volumes
  v[vol]
 end
end
subgraph storage
 pv[pv]
end
subgraph claim
 pvc[pvc]
end

pv --bound--> pvc
v --> m1
v --> m2
pvc --> v

PV Persistent Volumes decouple the storage requirements from pod development. PV use properties like accessModes, capacity, mountOptions, pvreclaimPolicy, volumeMode etc to mount the persistent volume to the pod.

PV can be created manually (manifest) or dynamically (using a storage class)

Access Modes can be one of the following

  • ReadWriteOnce (RWO) - A single node can mount this volume as read write. Many pods on this node can still use the volume.
  • ReadOnlyMany (ROX) - Many pods can mount the volume as read only.
  • ReadWriteMany (RWX) - Many pods can mount the volume as read, write.
  • ReadWriteOncePod (RWOP) - A single pod can mount the volume as read, write (version v1.22 onwards only).

PVC Persistent volume claims are used by pod authors to add storage needs in a declarative way, without worrying about storage specifics.

PVC use properties like accessModes, volumeMode, storageClassName, resources, selector to provision the storage as per the requirements. kubectl explain pvc.spec to know about all the properties.

Simple local example

# pv.yaml
kind: PersistentVolume
apiVersion: v1
metadata:
  name: pv-vol # not used anywhere
  labels:
    type: local
spec:
  accessModes:
   - ReadWriteOnce
  capacity:
    storage: 2Gi
  hostPath:
    path: "/data" # this should exist on host
# pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pv-claim # used in pod.spec.volumes[].pvc.claimName
spec:
  accessModes:
   - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi # <= pv.spec.capacity.storage
# pod.yaml
kind: Pod
apiVersion: v1
metadata:
  name: pv-pod
spec:
  containers:
    name: pv-container
    image: nginx
    ports:
      - containerPort: 80
        name: nginxhttp
    volumeMounts:
      - mountPath: "/usr/share/nginx/html"
        name: cvol # from pod.spec.volumes[].name
  volumes:
    - name: cvol
      persistentVolumeClaim:
        claimName: pv-claim # from pvc.metadata.name

#ask what happens if pvc.spec.requests.storage > pv.spec.capacity.storage ?

Storage Class

  • can be grouped according to anything - capacity, type, location etc.
  • Uses spec.provisioner to connect to the storage
  • When a PVC does not specify a storageClassName, the default StorageClass is used.
  • The cluster can only have one default StorageClass. If more than one default StorageClass is set, the newest default is used.

Example

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: kubernetes.io/aws-ebs
parameters: # provisioner specific parameters
  type: gp2
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
  - debug
volumeBindingMode: Immediate

ConfigMap

Decouple configuration from application

example, notice it uses data instead of the usual spec

apiVersion: v1
kind: ConfigMap
metadata:
name: nginxcm
data:
  # use in `pod.spec.volumes[].configMap.items[].key`
	nginx-custom-config.conf: | 
		server {
			listen 8080;
			server_name localhost;
			location / {
				root /usr/share/nginx/html;
				index index.html index.htm;
			}
		}

Use it in a pod

apiVersion: v1
kind: Pod
metadata:
 name: nginx
spec:
 containers:
  - name: nginx
    image: nginx
    volumeMounts:
     - name: conf
       mountPath: /etc/nginx/conf.d/
  volumes:
   - name: conf
     configMap:
      name: nginxcm
      items:
         # key as in configMap.data.key
       - key: nginx-custom-config.conf
         # path within the container
         path: default.conf

Secrets

Decouple sensitive variables from application

example - notice it uses data instead of the usual spec

apiVersion: v1
kind: Secret
metadata:
  name: secret
data:
	username: encodedusername
	password: encodedpassword

Kubernetes API

Collection of [[RESTful APIs]], supports GET, POST, DELETE. It is crucial to identify api version to use.

#ask why did kubernetes project choose this RESTful API approach?

To allow the system to continuously evolve and grow.

New features can be easily added without impacting existing clients as alpha, and moved to beta, then stable version as they mature.

It also allows the project to maintain compatibility with existing clients by offering both beta and stable version of an API simultaneously (for a length of time).

Versioning is done at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-life and/or experimental APIs.

The API server handles the conversion between API versions transparently: all the different versions are actually representations of the same persisted data. The API server may serve the same underlying data through multiple API versions.

So, if I create a resource using an API version v1beta1, I can later use v1 version to query or manage it (within the deprecation period). Some fields may need updating due to the API graduating to v1, but, I can still migrate to the newer version of the API without having to destroy and recreate the resource.

API versions cannot be removed in future versions until this issue is fixed.

API access is controlled by the API server.

It saves the serialized objects in [[etcd]].

API resources are distinguished by their API group, resource type, namespace (for namespaced resources), and name.

Monitor deprecated API requests - apiserver_requested_deprecated_apis metric. This can help identify if there are objects in the cluster still using deprecated APIs.

#ask kube-proxy, where is it hosted, host it works?

graph LR
subgraph server
api[api/etcd]
end

cr[curl] --> kp[kube-proxy] --> api

List available resource APIs, their kind, groups, version, namespaced (bool), version, any shortnames etc, use kubectl api-resources -o wide

API groups can be enabled or disabled using --runtime-config flag on API server

$ kubectl api-resources
NAME                              SHORTNAMES   APIVERSION                             NAMESPACED   KIND
..
configmaps                        cm           v1                                     true         ConfigMap
...
namespaces                        ns           v1                                     false        Namespace
nodes                             no           v1                                     false        Node
persistentvolumeclaims            pvc          v1                                     true         PersistentVolumeClaim
persistentvolumes                 pv           v1                                     false        PersistentVolume
pods                              po           v1                                     true         Pod
...
secrets                                        v1                                     true         Secret
serviceaccounts                   sa           v1                                     true         ServiceAccount
services                          svc          v1                                     true         Service
...
networking.k8s.io/v1                   false        IngressClass
ingresses                         ing          networking.k8s.io/v1                   true         Ingress
networkpolicies                   netpol       
...

List versions of available API kubectl api-versions

$ kubectl api-versions
..
apps/v1
authentication.k8s.io/v1
authorization.k8s.io/v1
autoscaling/v1
autoscaling/v2
..
flowcontrol.apiserver.k8s.io/v1beta2 flowcontrol.apiserver.k8s.io/v1beta3
networking.k8s.io/v1
node.k8s.io/v1
policy/v1
rbac.authorization.k8s.io/v1
scheduling.k8s.io/v1
storage.k8s.io/v1
v1

Find properties required for an object kubectl explain <object.property>

Find namespace scoped APIs or cluster wide APIs kubectl api-resources --namespaced=true kubectl api-resources --namespaced=false

Proxy kubectl to access the API more easily using [[Curl]]

#ask But why would you do this? If an app needs to interact with kubernetes, it can simply use the language specific http library to do this directly instead of going through kubectl

kubectl proxy --port=8080

$ curl http://localhost:8080/version
{
  "major": "1",
  "minor": "28",
  "gitVersion": "v1.28.2",
  "gitCommit": "89a4ea3e1e4ddd7f7572286090359983e0387b2f",
  "gitTreeState": "clean",
  "buildDate": "2023-09-13T09:29:07Z",
  "goVersion": "go1.20.8",
  "compiler": "gc",
  "platform": "linux/amd64"
}

Get pods from kube-system namespace (truncated output)

$ curl http://localhost:8080/api/v1/namespaces/kube-system/pods | les
{
  "kind": "PodList",
  "apiVersion": "v1",
  "metadata": {
    "resourceVersion": "1555"
        "name": "coredns-5dd5756b68-fcz42",
        "generateName": "coredns-5dd5756b68-",
        "namespace": "kube-system",
        "uid": "326ba1b7-31b6-4d6c-9978-1057f6734154",
        "resourceVersion": "553",
..

Check the openapi v3 specification (truncated output) on /openapi/v3, and v2 specification on /openapi/v2

$ curl http://localhost:8080/openapi/v3
{
"paths": {
  ".well-known/openid-configuration": {
    "serverRelativeURL": "/openapi/v3/.well-known/openid-configuration?hash=4488--"
  },
  "api": {
    "serverRelativeURL": "/openapi/v3/api?hash=929E--"
  },
  "api/v1": {
    "serverRelativeURL": "/openapi/v3/api/v1?hash=5133--"
  },
  "apis": {
    "serverRelativeURL": "/openapi/v3/apis?hash=27E0--"
  },
  "apis/admissionregistration.k8s.io": {
    "serverRelativeURL": "/openapi/v3/apis/admissionregistration.k8s.io?hash=E8D5.."
  }
}

API Extensions

Custom Resources

This is a way to make the API server recognize new non-standard Kubernetes objects.

Example - Prometheus Operator uses a number of CRDs to manage the deployment in a cluster.

Aggregation Layer

Needs to be enabled and then runs in-process in the kube-apisever.

You first need to create an APIService object, say myawesomeapi, at a path, say apis/myawesomeapi/v1beta1/. The aggregation layer then proxies any requests API server receives for this API to the registered APIService.

Example - metrics server

Create a Cluster Manually!

Kubernetes releases before v1.24 included a direct integration with Docker Engine, using a component named dockershim.

What is dockershim?

To provide support for multile container runtimes, CRI API/ specification was developed. But since docker was the first container runtime k8s supported, and to maintain backward compatibility, dockershim was developed which allowed kubelet to interact with docker runtime via the CRI API, sort of like a proxy?

graph LR;
kb[kubelet] <--cri--> ds[dockershim] <--> dc[docker] <--> cd[containerd] --> c1[container 1]
cd --> c2[container 2]
cd --> cn[container n]
graph LR;
kb[kubelet] <--cri--> ccd[cri-containerd] <--> cd[containerd] --> c1[container 1]
cd --> c2[container 2]
cd --> cn[container n]

Create a 3 node cluster - 1 control node, and 2 worker nodes.

  1. On control node, kubeadm init
  2. On control node, Networking
  3. On worker node, kubeadm join

Control node

  1. Install a container runtime like docker, containerd, cri-o.
  2. Install kube tools like kubeadm, kubelet, kubectl
  3. kubeadm init Ref
  4. Setup $HOME/.kube/config
  5. Verify all hosts are present under /etc/hosts
  6. Install a pod network add-on - any one of calico, cilium, flannel etc.

Worker nodes

  1. Install a container runtime like docker, containerd, cri-o.
  2. Install kube tools like kubeadm, kubelet, kubectl
  3. kubeadm join --token xx --discovery-cert xx

To create high availability - use 3 controller nodes, each running with etcd, or use a dedicated etcd cluster

Older, needs refining

Pod Disruption Budgets https://kubernetes.io/docs/tasks/run-application/configure-pdb/ Based on the value of maxUnavailable for specific pods, cluster autoscaler will either ignore a node, or scale it down.

Pod Affinity https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ uses pod labels

example

apiVersion: v1
kind: Pod
metadata:
  name: label-demo
  labels:
    environment: production
    app: nginx
spec: . . .

Pods Anti Affinity https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity

labels allow us to use selectors

labels are also useful to slice/dice resources when using kubectl

kubectl get pods -Lapp -Ltier -Lrole

-L displays an extra column in kubectl output -l either selects or update the label applied to a resourse.

Selectors can be of 2 types

Equality based (accelerator=nvidia-tesla-p100)

apiVersion: v1
kind: Pod
metadata:
  name: cuda-test
spec:
  containers:
    - name: cuda-test
      image: "registry.k8s.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    accelerator: nvidia-tesla-p100

Set based

apiVersion: v1
kind: Pod
metadata:
  name: cuda-test
spec:
  containers:
    - name: cuda-test
      image: "registry.k8s.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    environment: qa,qa1 # and condition
    accelerator in (nvidia, intel)

service, replicationcontroller format for selector

selector:
	component: redis

daemonset, replicaset, deployment, job format for selector

selector:
	matchLabels:
		component: redis
	matchExpressions:
		- {key: component, values: [redis]}

Labels

Standard or default

kubernetes.io/arch
kubernetes.io/hostname # cloud provider specific
kubernetes.io/os
node.kubernetes.io/instance-type # if available to kubelet
topology.kubernetes.io/region    #
topology.kubernetes.io/zone      # 

Labels and selectors

[[K8S Scheduling]]

Troubleshooting

Pod Error Alerts

sum (kube_pod_container_status_waiting_reason{reason=~"CrashLoopBackOff|ImagePullBackOff|ErrImagePull.+"}) by (namespace, container, reason)

Questions

1. What is the value of kubernetes.io/hostname in [[eks]]?

I know it’s part of standard labels #todo (link it) but, not seen this tag really on [[eks]]. Found following tags instead 🤷

  • kubernetes.io/cluster/myawesomecluster=owned
  • `aws:eks:cluster-name=myawesomecluster
2. How do pods communicate with each other in a cluster?
3. How will you control which pod runs on which node(s)

Mix of scheduling options like node selector, affinity/anti-affinity, taints, tolerations etc.