7 Surprising Facts About the World of DevOps

Shachar Shapira
Jan 15
2 min read

Even after years in the field, some things continue to surprise even the most experienced professionals.

1. DevOps is Moving Beyond Programming

Not everyone knows that 10% of organizations are already aiming for a NoOps model. This means nearly autonomous operations, where most tasks are performed by bots, automation pipelines, and intelligent systems — without the need for daily human intervention.

2. YAML Has Become a Major Source of Failures

An indentation issue in a Kubernetes, GitHub Actions, or ArgoCD file can bring down an entire environment or cause functional disruptions. There are more failures related to YAML structure than to its actual content.

3. Most Kubernetes Failures Aren't Actually About Kubernetes

80% of issues stem from container images, incorrect resource configurations, improper Probe settings, or CNI failures — not from Kubernetes itself.

4. "Terraform Apply" Can Be a Critical Risk Factor

In large organizations managing multiple components, a small change can wipe out IAM, DNS, clusters, or critical infrastructure. That's why large organizations run everything through Policy-as-Code and use GitOps processes with double-checking via PRs, branches, and Git mechanisms to prevent critical errors.

The rule: Treat Terraform as code → GitOps workflow.

5. GitOps is Not a Deploy Tool — It's a Consistency Engine

ArgoCD and Flux don't "deploy" code; they correct drift. Any manual change? GitOps will automatically revert it.

6. 90% of Outages Start with Small Configuration Mistakes

One incorrect CIDR (like /16 instead of /24), a missing ALB Health Check, or a CPU Limit set without a Request definition — and production goes down.

7. Container Restart Policies Hide Real Failures

A Pod with "restartPolicy: Always" can appear "healthy" while actually experiencing CrashLoopBackOff cycles, memory leaks, container startup failures, or unattached volumes.

Remember: Auto-healing ≠ healthy.

Bonus Fact: Observability is Mostly About Cardinality, Not Metrics

Instead of reporting metrics like CPU load, disk volumes, or traffic amounts, we want to know: "Only some users in region A see 500s from service B when the request includes header X and passes through Pod Y on Node Z… but only after a deploy."

Instead of counting metrics, we want a picture that tells us the unknown.

Most Prometheus crashes are caused by incorrect use of Labels and unbounded metric dimensions. Prometheus brings down more clusters than it "saves" if not properly maintained.