Building Kuberik

Luka Rumora - - 4 mins read

I have been trying to make CD on Kubernetes work properly for years. Kuberik is my third attempt, and the first one I think is worth showing to anyone.

The first attempt came around the time Argo and Tekton showed up. I was writing a native pipeline engine for Kubernetes and gave both a serious try, partly to see if I should give up on my own thing and use one of theirs. I ended up not using either. Wrapping a sequence of imperative steps in a CRD does not actually make it declarative. A YAML pipeline still has the same failure modes as a Jenkinsfile. I shelved my pipeline engine.

The next time I came back, I had been running flux in production for a while. flux is the closest thing to a CD tool built on Kubernetes principles: declarative, in-cluster, composable. There were design choices I did not love, and I started building something that improved on them. The clearest example was injecting values that differ per cluster. I wanted to do this with Kustomize transformers, which at the time felt like the right approach to configuration. I had long been wary of templating, and even a small amount of it in my tools made me uncomfortable. flux already has envsubst, which I had written off for the same reason. I was wrong. Used carefully, with strict substitution and a CI check that catches anything missed, envsubst works very well. By the time I worked through this and a few similar things, I realized I was rebuilding flux slightly differently and not really adding anything. I let that one go too.

What I kept running into was that flux ends where the rollout begins. Two things in my own work made this obvious. The first was deploying custom software whose code changes constantly. The in-cluster option is the flux image updater, which writes back to your git repository on every new image. I did not want that. The cluster should probably not even have read access to the git repo, let alone write. The alternative is a CI script that bumps the image tag in the manifest before flux ever sees it. That works, but it pushes the work into CI, where I had no way to coordinate it across environments. The second was managing infrastructure across many clusters. A pile of YAML that needs to land on every cluster. Any risky change meant pausing every cluster except one, merging, watching, then unpausing the rest one at a time. flux on its own syncs from main on every cluster as soon as it lands. There is no built-in way to do this in stages.

The third attempt took a long time to come together. The hard part was finding a way to put all the pieces together as a declarative flow. I started top-down, focused on multi-environment promotions, and ran into edge cases I could not solve cleanly. Eventually I let it go and tried a simpler problem: just update the image on one cluster. That is when the architecture came in.

Look at the manual workflow and there are really only two things to care about: when to deploy a version, and when to decide it has deployed successfully. The architecture maps to those two questions. A new version comes in from a registry. Gates decide which versions are allowed at the current moment. The rollout deploys the latest version that all gates allow, and reads the health checks to decide whether the deploy succeeded or failed. The unit of rollout is an OCI image, your application or a packaged-up bundle of manifests, and the same machinery handles both. Once this was in place, it extended naturally to multi-environment promotion and everything in between. That is Kuberik.

The full pitch and the design rationale are in a separate post on the project blog: Introducing Kuberik.

Have questions or comments? Feel free to reach out to me via e-mail.