04/08/2026
Does your team know who owns the Kubernetes cluster at 2am?
Most US mid-market companies adopt Kubernetes for a technical reason. The cluster breaks for an operational one.
Two engineers build it. They understand it well.
6 months later they move on. The team that inherits it has no documented picture of how it was designed or where it breaks.
Every incident starts with an hour of archaeology.
Four configurations that separate stable clusters from fragile ones:
1. Namespace-per-team with enforced resource quotas -- one workload consuming all cluster memory evicts other teams' services without warning
2. GitOps via ArgoCD or Flux -- every cluster change version-controlled and reversible
3. Probe discipline -- misconfigured liveness and readiness probes cause more production incidents than any other single config issue
4. PodDisruptionBudgets on every production workload -- without one, a rolling node upgrade can take down all replicas simultaneously
↳ Observability stack before the first production workload, not after the first incident
↳ Resource requests and limits on every workload before you touch HPA
↳ ECS Fargate, Azure Container Apps, or Nomad may get you further if your team is under five platform engineers
The operating model has to exist before the cluster does.
📖 We put together a full breakdown of everything above – configurations, alternatives, autoscaling, and the VMware/VKS path if your org runs on VCF. It took a while to write and we think it covers the parts most Kubernetes guides skip. Worth a read: https://rewtechnology.com/study-case/case-study/kubernetes-project-succeed-fail-before-first-deployment/