Setting up a highly available (HA) Kubernetes cluster requires a load-balanced endpoint for the Kubernetes API server. This guide demonstrates how to configure an HA Cilium load balancer for the Control Plane provisioned by ClusterAPI.

General idea

Original idea comes from similar MetalLB request.

Reconfigure kubelets on apiserver nodes to talk only to 127.0.0.1. This way, MetalLB can schedule on the apiserver nodes, and it can set up the cluster LB. From there, all other kubelets can connect to the LB IP, and everything works. One implication of this is that kubelets on the apiserver machines will be “less reliable”, because they will drop out of the cluster if their local apiserver is unavailable, even if the apiserver LB IP is still working. This is probably OK, because if the apiserver is unhealthy, the machine is probably pretty broken anyway… But it’s still forcing users to change the availability semantics of their cluster.

I’d argue that failure scenarios for this implementation are negligible and unlikely to have real consequences. The local API server is the stateless pod with the highest priority. Chances are that if the API server fails, nothing else on the node will work very well. The likely scenario is that the Cilium pod on that node would start failing the health check because it cannot connect to the API server. Therefore, if the failed node advertises the Control Plane’s load balancer IP at the moment of failure, the IP will fail over to one of the other working Control Plane nodes.

High level solution step-by-step

  1. First Control Plane node gets provisioned
    1. Node boots up
    2. Hardcode a DNS entry for Control Plane host to point to 127.0.0.1. This will make the cluster functional so that we can deploy Cilium CNI later
    3. kubeadm initializes the cluster
  2. Configure Cilium CNI
    1. Deploy Cilium CNI
    2. Configure LoadBalancer service
    3. Advertise the LoadBalancer only from ControlPlane nodes. Worker nodes would not be able to route packets to API server.
  3. Other Control Plane nodes get provisioned
    1. Node boots up
    2. kubeadm runs with join configuration
    3. Hardcode a DNS entry for Control Plane host to point to 127.0.0.1 so that the node continues working independently from the first node.
  4. Worker nodes join the cluster. These nodes won’t have any hardcoded DNS and will connect to API server through load-balanced endpoint.

Implementation

Let’s assume our desired IP address is 10.10.10.10.

We first need to create a KubeadmControlPlane with the following patches.

apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
spec:
  kubeadmConfigSpec:
    initConfiguration:
      skipPhases:
        - addon/kube-proxy
    preKubeadmCommands:
      - |
        set -x
        INIT_CONFIG='/run/kubeadm/kubeadm.yaml'
        if [ -f "${INIT_CONFIG}" ]; then
          API_ENDPOINT="$(grep -oP '(?<=controlPlaneEndpoint: ).+(?=:6443)' ${INIT_CONFIG})"
          echo "127.0.0.1 $API_ENDPOINT" >> /etc/hosts
        fi        
    postKubeadmCommands:
      - |
        set -x
        JOIN_CONFIG='/run/kubeadm/kubeadm-join-config.yaml'
        if [ -f "${JOIN_CONFIG}" ]; then
          API_ENDPOINT="$(grep -oP '(?<=apiServerEndpoint: ).+(?=:6443)' ${JOIN_CONFIG})"
          echo "127.0.0.1 $API_ENDPOINT" >> /etc/hosts
        fi        

note

What’s going on here?

  • skipPhases will deploy Kubernetes cluster without kube-proxy, which we’ll enable us to run Cilium as a replacement for kube-proxy.
  • preKubeadmCommands will hardcode DNS entry of Control Plane endpoint to localhost before any services start on the initial node.
  • postKubeadmCommands on the other hand will hardcode these entries for other nodes joining the cluster only after they’ve joined the cluster. The Load Balancer will be active at this point, and we can use it for the nodes to join the cluster. However, after they’ve joined the cluster, they need to access kubeapi-server through 127.0.0.1.

After the first Node is provisioned, we need to deploy CNI in order for it to become Ready. We’ll deploy Cilium with Helm values provided below. Since the cluster doesn’t have sufficient configuration at this point, we’ll need to replace server: https://10.10.10.10:6443 in KUBECONFIG with https://10.10.10.1:6443 (assuming 10.10.10.1 is the IP of our node).

k8sServiceHost can be replaced with another FQDN that resolves to your load balancer IP.

kubeProxyReplacement: true
k8sServiceHost: 10.10.10.10.nip.io
k8sServicePort: 6443
l2announcements:
  enabled: true
k8sClientRateLimit:
  qps: 10
  burst: 30

Since 10.10.10.10.nip.io resolves to localhost, Cilium will connect without a problem to kube-apiserver.

Next, let’s create a LoadBalancer service for Control Plane.

apiVersion: v1
kind: Service
metadata:
  annotations:
    io.cilium/lb-ipam-ips: 10.10.10.10
  labels:
    component: kube-apiserver
    tier: control-plane
  name: kubernetes-external
  namespace: kube-system
spec:
  selector:
    component: kube-apiserver
    tier: control-plane
  ports:
  - name: https
    port: 6443
    protocol: TCP
    targetPort: 6443
  type: LoadBalancer
---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: api-server
  annotations:
spec:
  blocks:
  - cidr: "10.10.10.10/32"

Finally, configure CiliumL2AnnouncementPolicy to advertise the LoadBalancer IP from Control Plane nodes.

apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
  name: control-plane
spec:
  serviceSelector:
    matchLabels:
      component: apiserver
      provider: kubernetes
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
  loadBalancerIPs: true

At this point, other Control Plane nodes & worker nodes will be able to join the cluster without issues! 🎉

Have questions or comments? Feel free to reach out to me via e-mail.