HA Cilium Load Balancer for Kubernetes Control Plane provisioned by ClusterAPI
Setting up a highly available (HA) Kubernetes cluster requires a load-balanced endpoint for the Kubernetes API server. This guide demonstrates how to configure an HA Cilium load balancer for the Control Plane provisioned by ClusterAPI.
General idea
Original idea comes from similar MetalLB request.
Reconfigure kubelets on apiserver nodes to talk only to 127.0.0.1. This way, MetalLB can schedule on the apiserver nodes, and it can set up the cluster LB. From there, all other kubelets can connect to the LB IP, and everything works. One implication of this is that kubelets on the apiserver machines will be “less reliable”, because they will drop out of the cluster if their local apiserver is unavailable, even if the apiserver LB IP is still working. This is probably OK, because if the apiserver is unhealthy, the machine is probably pretty broken anyway… But it’s still forcing users to change the availability semantics of their cluster.
I’d argue that failure scenarios for this implementation are negligible and unlikely to have real consequences. The local API server is the stateless pod with the highest priority. Chances are that if the API server fails, nothing else on the node will work very well. The likely scenario is that the Cilium pod on that node would start failing the health check because it cannot connect to the API server. Therefore, if the failed node advertises the Control Plane’s load balancer IP at the moment of failure, the IP will fail over to one of the other working Control Plane nodes.
High level solution step-by-step
- First Control Plane node gets provisioned
- Node boots up
- Hardcode a DNS entry for Control Plane host to point to 127.0.0.1. This will make the cluster functional so that we can deploy Cilium CNI later
kubeadm
initializes the cluster
- Configure Cilium CNI
- Deploy Cilium CNI
- Configure LoadBalancer service
- Advertise the LoadBalancer only from ControlPlane nodes. Worker nodes would not be able to route packets to API server.
- Other Control Plane nodes get provisioned
- Node boots up
kubeadm
runs with join configuration- Hardcode a DNS entry for Control Plane host to point to 127.0.0.1 so that the node continues working independently from the first node.
- Worker nodes join the cluster. These nodes won’t have any hardcoded DNS and will connect to API server through load-balanced endpoint.
Implementation
Let’s assume our desired IP address is 10.10.10.10
.
We first need to create a KubeadmControlPlane
with the following patches.
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
spec:
kubeadmConfigSpec:
initConfiguration:
skipPhases:
- addon/kube-proxy
preKubeadmCommands:
- |
set -x
INIT_CONFIG='/run/kubeadm/kubeadm.yaml'
if [ -f "${INIT_CONFIG}" ]; then
API_ENDPOINT="$(grep -oP '(?<=controlPlaneEndpoint: ).+(?=:6443)' ${INIT_CONFIG})"
echo "127.0.0.1 $API_ENDPOINT" >> /etc/hosts
fi
postKubeadmCommands:
- |
set -x
JOIN_CONFIG='/run/kubeadm/kubeadm-join-config.yaml'
if [ -f "${JOIN_CONFIG}" ]; then
API_ENDPOINT="$(grep -oP '(?<=apiServerEndpoint: ).+(?=:6443)' ${JOIN_CONFIG})"
echo "127.0.0.1 $API_ENDPOINT" >> /etc/hosts
fi
note
What’s going on here?
skipPhases
will deploy Kubernetes cluster withoutkube-proxy
, which we’ll enable us to run Cilium as a replacement forkube-proxy
.preKubeadmCommands
will hardcode DNS entry of Control Plane endpoint tolocalhost
before any services start on the initial node.postKubeadmCommands
on the other hand will hardcode these entries for other nodes joining the cluster only after they’ve joined the cluster. The Load Balancer will be active at this point, and we can use it for the nodes to join the cluster. However, after they’ve joined the cluster, they need to accesskubeapi-server
through 127.0.0.1.
After the first Node
is provisioned, we need to deploy CNI in order for it to become Ready. We’ll deploy Cilium with Helm values provided below. Since the cluster doesn’t have sufficient configuration at this point, we’ll need to replace server: https://10.10.10.10:6443
in KUBECONFIG
with https://10.10.10.1:6443
(assuming 10.10.10.1
is the IP of our node).
k8sServiceHost
can be replaced with another FQDN that resolves to your load balancer IP.
kubeProxyReplacement: true
k8sServiceHost: 10.10.10.10.nip.io
k8sServicePort: 6443
l2announcements:
enabled: true
k8sClientRateLimit:
qps: 10
burst: 30
Since 10.10.10.10.nip.io
resolves to localhost
, Cilium will connect without a problem to kube-apiserver
.
Next, let’s create a LoadBalancer service for Control Plane.
apiVersion: v1
kind: Service
metadata:
annotations:
io.cilium/lb-ipam-ips: 10.10.10.10
labels:
component: kube-apiserver
tier: control-plane
name: kubernetes-external
namespace: kube-system
spec:
selector:
component: kube-apiserver
tier: control-plane
ports:
- name: https
port: 6443
protocol: TCP
targetPort: 6443
type: LoadBalancer
---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
name: api-server
annotations:
spec:
blocks:
- cidr: "10.10.10.10/32"
Finally, configure CiliumL2AnnouncementPolicy
to advertise the LoadBalancer
IP from Control Plane nodes.
apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
name: control-plane
spec:
serviceSelector:
matchLabels:
component: apiserver
provider: kubernetes
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
loadBalancerIPs: true
At this point, other Control Plane nodes & worker nodes will be able to join the cluster without issues! 🎉
Useful links
- Commit implementing this solution in metalkast (also includes Ingress LB)
- Issue in MetalLB
- Issue in Cilium
- Issue in Kubernetes
Have questions or comments? Feel free to reach out to me via e-mail.