The End of Person-Dependent Automation

In the history of mid-sized IT infrastructures and system houses, having one’s own data center was considered an undeniable competitive advantage for decades. Those who control the hardware have absolute data sovereignty, manage update cycles independently, and can flexibly address compliance issues. To manage the growing number of servers and customer applications, clever administrators early on adopted automation tools: VMware for virtualization, Ansible for provisioning, and custom shell scripts or cron jobs for recurring Day-2 tasks.

However, these organically grown structures hit an invisible but relentless boundary as the portfolio grows and customer demands increase. What starts as sensible, pragmatic automation gradually becomes operational debt. The risk rarely lies with the tools themselves but with a fundamental conceptual flaw: the lack of a consistent system and platform logic.

The Script Dilemma: When Automation Becomes the Source of Errors

When IT infrastructures grow without a standardized platform architecture, control over operational processes also fragments. In practice, three critical weaknesses emerge in grown models:

1. The Person-Dependence of Imperatives

Ansible playbooks or Bash scripts exist, but they are often maintained individually and are not centrally versioned. This means that the automation is imperative and person-dependent. If Administrator A writes a script, it works on their workstation, with their specific environment variables and under their implicit assumptions. If Administrator B faces a slightly shifted reality in the data center, execution fails - the automation itself becomes an unpredictable risk factor.

2. The Black-Box Behavior of Day-2 Tasks

Maintenance tasks like backups, log rotations, TLS certificate renewals, or capacity scaling run isolated over local cron jobs on the respective VMs. These tasks are not systemically and centrally monitored. The result: Problems like expired certificates or full disks are not discovered by the system but are only reported by the customer after failure. Operations remain in a permanently reactive mode.

3. The Incomplete Audit Trail for Compliance and SLAs

Who made what change to a customer application and when? With scattered scripts and manual ad-hoc interventions on servers via SSH, there is no consistent, tamper-proof record. However, what is not documented, versioned, and exportable in the course of strict EU security directives like NIS-2 or DORA does not officially exist in an audit. The risk of contractual penalties or liability issues increases with every manual adjustment.

The Declarative Turn: Kubernetes and GitOps as Platform Standard

To break the vicious cycle of person-dependent administration, a radical departure from the imperative principle (“Do step A, then step B”) is required. The modern solution lies in establishing a declarative operating model, realized through the combination of Kubernetes and GitOps.

This paradigm fundamentally shifts the logic of infrastructure management:

[ Git-Repository: Single Source of Truth ]
 (Desired State as declarative code)
                  |
                  v (Automatic Pull / Webhook)
       [ GitOps-Controller (ArgoCD) ] <==============+
                  |                                   |
                  | (Continuous Reconciliation)       | (Current State)
                  v                                   |
    [ Kubernetes API-Gateway ]                        |
                  |                                   |
                  v                                   |
   [ Central Resource Pool ] =====================+
 (Managed Nodes, Networks, Storage)

1. Describing the Desired State Instead of Command Chains

In the declarative model, developers and platform engineers describe only the desired target state (Desired State) of an application or infrastructure component in standardized YAML files (e.g., via Helm or Kustomize). It is specified how many instances must run, which storage is connected, and which environment variables apply. How this state is achieved is autonomously decided by the platform.

2. Git as the Single Source of Truth

All configuration files reside in a central, version-controlled Git repository. A GitOps controller (like ArgoCD) within the cluster continuously monitors this repository. Every change to the system - whether app update, scaling, or configuration adjustment - must be documented as a commit or pull request in Git. Manual “quick fixes” via SSH directly on the servers are thus a thing of the past.

3. Automatic Reconciliation Loop

The platform continuously compares the defined desired state in the Git repository with the actual current state in the data center. If the two states diverge, for example, because a service crashes or a configuration is manually manipulated, the platform intervenes to correct it. It autonomously restores the state defined in the code in the same split second (Self-Healing).

Strategic Value: Scalability and Auditability at the Push of a Button

The transformation from a grown infrastructure to a consistent operating platform changes the resilience and cost-effectiveness of the entire data center operation:

Fault Tolerance Through Standardization: Since all application landscapes, network boundaries, and storage assignments are modular and versioned as code, operations become independent of individuals. If an administrator is unavailable, any other team member can view, understand, and replicate the exact state of the infrastructure.
Effortless Proof in Audits: The Git history inherently provides the perfect, time-stamped audit trail. Auditors can be shown exactly who approved which change, when backups were successfully validated, and that security updates have been rolled out across the board. Compliance becomes an automated byproduct of the platform architecture rather than a bureaucratic burden.
Guaranteed SLA Stability Through Day-2 Automation: Core components like the cert-manager automatically renew TLS certificates as an integral part of the platform logic. Integrated backup systems cyclically and autonomously test the emergency case (Restore). Risks are minimized before they can jeopardize the contractually assured SLAs.

Conclusion: Platform Logic Beats Tool Chaos

Scaling a modern portfolio of customer applications cannot be solved by paratactically stringing together more automation tools. Those who combat complexity with individual scripts reap operational instability. True digital sovereignty in one’s own data center only arises when operations are transformed from a person-dependent service to a standardized, measurable, and auditable operating platform. Only through this architectural step is full control over the infrastructure maintained, while simultaneously relieving the engineering team for tomorrow’s innovations.

FAQ: The Path to a Declarative Platform

Do we have to completely discard our existing Ansible structures?

No. The transition to declarative platform logic is an evolutionary process. Ansible can be excellently used in the transition phase to provide the underlying bare operating systems (Bare Metal or VMs) and the basic network configuration on which the Kubernetes cluster is based. However, the management, scaling, and securing of the actual customer applications and their Day-2 services are consistently handed over to the declarative level of Kubernetes and GitOps.

How is it prevented that sensitive passwords end up in the Git repository?

This is the core question in a consistent GitOps approach. Since plaintext secrets must never be checked into the Git repository, the platform is extended with a specialized operator (like the External Secrets Operator). Only declarative placeholder manifests remain in the Git repository. At the moment of deployment, the operator resolves these placeholders and securely retrieves the real, AES-256-encrypted passwords from a central identity fortress (like OpenBao or a Key Vault).

How does the system react if the Git repository is temporarily unavailable?

The platform operates autonomously. Should the central Git repository be temporarily unavailable due to a network disruption, all active customer applications and Day-2 processes in the cluster continue to run completely undisturbed in the last known desired state. In this phase, only the rollout of new software releases or the making of structural configuration changes is blocked until the connection to the source of truth is restored.