In-House Operations Platform
Use Cases In-House Operations Platform

In-House Operations Platform

Operating your own data center was long considered a competitive advantage—especially for system integrators who not only develop but also manage customer applications. Mastering operations allows for guaranteed availability, controlled updates, data sovereignty, and confident responses to compliance questions.
betriebsplattform kubernetes systemhaus rechenzentrum automatisierung compliance monitoring

How ayedo Guided a System Integrator from Evolved Operations to an Auditable Operations Platform

Operating your own data center was long considered a competitive advantage—especially for system integrators who not only develop but also manage customer applications. Mastering operations allows for guaranteed availability, controlled updates, data sovereignty, and confident responses to compliance questions.

In practice, however, this advantage often flips exactly when it becomes most crucial: when the portfolio grows, demands increase, and customers expect not just “it works” but measurable SLAs, traceable operational processes, and auditable evidence.

In this post, we demonstrate through an anonymized customer project how ayedo supported a mid-sized IT system integrator in building their own Kubernetes-based operations platform in their data center—not as a Managed Service, but as enablement: with workshops, architectural and implementation guidance, modular automation via Polycrate blocks, and a 24/7 support safety net for critical situations.

The customer remains anonymous. The solution—and the approach behind it—is reproducible.


Initial Situation: When “Evolved” Becomes Operational Debt

The customer is a system integrator with around 90 employees, its own data center, and a portfolio of over 40 managed customer applications. The infrastructure had been sensibly expanded over the years: VMware for virtualization, Ansible for provisioning, Nagios for monitoring, supplemented by scripts and cron jobs.

This initially sounds like “standard.” And it was—until the reality of growth exposed the weaknesses of this model.

The central problem was not a single tool. It was the lack of platform logic. Ansible playbooks existed but were individually maintained and inconsistent. Thus, automation was not reproducible but person-dependent. If Admin A wrote a playbook, it worked on their workstation, with their variables, with their assumptions. Admin B faced a slightly different reality—and suddenly automation became a source of errors.

A similar pattern emerged in operations. Day-2 tasks like backups, updates, certificate renewals, log rotation, or scaling were run manually or semi-automatically via cron jobs. The tasks were not systematically monitored. Many problems were not “discovered” but “reported”—mostly by customers. Expired certificates, full disks, versions with known security vulnerabilities: These are not primarily technical problems. They are process problems.

The monitoring was symptomatic. Nagios provided binary states: service reachable or not. What was missing were trend data and operational metrics. How do response times develop? Is there creeping performance degradation? Where are bottlenecks forming? How can capacities be planned? Without metrics, logs, and correlatable signals, operations remain reactive.

And then came the point where “reactive” was no longer acceptable: SLA evidence.

Several customers demanded monthly availability reports. These were manually compiled from logs. It was time-consuming, error-prone, and—crucially—it was not a reliable management basis. Whether SLAs were met was more estimation than measurement.

In parallel, compliance questions increased, especially towards NIS-2, access controls, backup strategies, and evidence documentation. Measures were partially in place—but not auditable. What is not documented, versioned, and exportable does not exist in audits.


The Trigger: “Traceable Operations Platform or Contract Risk”

The turning point came when a strategically important customer made it clear: The operations contract will only be renewed if a modern, traceable operations platform can be demonstrated within a year—including automated monitoring, SLA tracking, and documented incident response processes.

For many companies, this is the moment when outsourcing or public cloud appears as a “quick solution.” For this system integrator, that was not an option. Operations were a core competence and part of the value proposition. Customers expected control over infrastructure—and that was to be preserved.

The question was not “How do we outsource operations?” but: How do we build operations so that they are scalable, standardized, measurable, and auditable?

This is where we at ayedo came in.


ayedo’s Approach: Enablement + Platform Engineering Instead of Tool Fireworks

In such projects, success is not determined by the toolset but by the sequence: knowledge, architecture, standardization, automation, operational security.

Kubernetes played a central role—not as a trend. But as a declarative operations model that enables platform logic: desired state, reproducible deployments, standardized workloads, clear interfaces between teams.

At the same time, it was clear: The team had no Kubernetes know-how. Introducing on-premises Kubernetes without building competence creates a platform that looks modern but is not operationally mastered. We wanted to avoid exactly that.

Therefore, our approach consisted of five phases that build on each other.


Phase 1: Building Know-How in Production—Not in the Sandbox

We started with a structured series of workshops explicitly designed for productive operations. The important thing was not just the curriculum but the environment: The workshops took place on-site and worked directly on the real infrastructure, networks, and servers that would later run productively.

The team did not need to “understand Kubernetes” abstractly, but had to learn how to install, operate, secure, and make a cluster observable—in their own data center, with their own conditions.

A focus was on typical on-premises pitfalls: cluster topology, high availability of the control plane, network design (CNI) in the existing setup, storage integration (CSI) for persistent workloads, and GitOps as the operational standard instead of “kubectl on demand.”

At the end of this phase, there was no certificate, but a production-ready cluster that the team had built themselves. That is the difference between training and enablement.


Phase 2: Designing the Target Architecture Together—Suitable for Hardware, Customers, and Skill Level

Technical depth does not arise from “more components” but from appropriate decisions.

In the architectural planning, we developed a target architecture together with the customer that considers the existing hardware but also introduces clear principles:

Declarative instead of imperative, Git as the single source of truth, standardized observability, and a backup/DR concept that not only exists but is regularly tested.

Especially in the on-premises context, it was important to us that the platform does not depend on individuals. This means: clear cluster topology, traceable network and storage decisions, standard paths for deployments, and a platform “definition” that is versioned.


Phase 3: Building the Platform Modularly—With Polycrate Blocks Instead of Patchwork Integration

Many teams lose months because they evaluate every Helm chart themselves, test every Grafana version themselves, and solve every integration edge themselves. In the end, an individual stack is created that is only maintainable with great effort.

Our approach was modular: The platform was built from Polycrate blocks—versioned, tested infrastructure components from the PolyHub, which we continuously maintain and update.

The observability stack was based on VictoriaMetrics, VictoriaLogs, and Grafana. This was not a “monitoring tool change,” but a shift in the operational paradigm: away from binary alarms, towards metrics, logs, and dashboards that make trends visible. This makes capacity planning, performance analysis, and anomaly detection possible in the first place.

ArgoCD was introduced as a GitOps deployer. This means every change is a commit, every change is traceable, every deployment is reproducible. No SSH on servers, no “fixes” at night that no one documents.

For backup and disaster recovery, Velero was integrated—including configurable retention and automatable restore tests. This is crucial: Backups are worthless if restores are never rehearsed.

cert-manager automated TLS certificate management. This sounds trivial but is a huge operational lever: Certificates do not “accidentally” expire but are renewed as part of the platform logic.

Authentik was introduced as central identity management to consistently control access to internal tools and platform components via SSO—a key component for auditability and access evidence.

The important thing is: Polycrate blocks not only bring installation but also operational standardization. Updates are pulled via polycrate pull and rolled out via ArgoCD. Custom adjustments remain intact because Polycrate supports inheritance. This massively reduces maintenance effort without losing flexibility.


Phase 4: Day-2 Automation and SLA Tracking with Polycrate API

The greatest quality gain in operations is not achieved through “cluster runs” but through Day-2 automation: monitoring integration, discovery, health checks, compliance evidence, SLA reporting.

This is where Polycrate API came into play.

The platform automatically recognizes newly deployed applications, ingresses, certificates, and backup jobs. New customer applications are included in the monitoring without manual additional configuration—a crucial difference from classic toolchains, where monitoring always has to be “followed up.”

Endpoints are continuously checked, including TLS configuration and certificate lifetimes. This gives the team not only “uptime” but a real service signal.

The core, however, was the SLO/SLA management. For each application, internal SLOs (e.g., 99.95%/30 days) and contractual SLAs (e.g., 99.9%/365 days) were defined. Polycrate API measures actual availability, calculates error budgets, and warns automatically before a breach occurs.

This is important: SLA management is not reporting. It is control. Error budgets make availability plannable and allow prioritization in operations.

Downtime tracking also became systemic: planned maintenance windows can be marked as not SLA-relevant; unplanned outages are cleanly recorded. This makes reports not only “nice” but reliable.

Additionally, regular reconciliation checks ran: backup status, certificate lifetimes, resource utilization, replication lag, pending updates. This shifts operations from “we react” to “we prevent.”

And finally: audit logs. All actions on the platform are logged—who deployed what when, who changed accesses, when a backup was performed. This is the basis for not only claiming compliance but proving it.


Phase 5: 24/7 Priority Support as a Safety Net—Without Dependency

A typical dilemma with on-premises Kubernetes: You want independence, but you don’t want to be alone at three in the morning when the control plane wobbles or storage behaves strangely.

Therefore, ayedo Priority Support was deliberately positioned as a safety net, not as a replacement for the team. In the event of an incident, an escalation level is ready—with defined response time and 24/7 availability. For many organizations, this is

Diesen Use Case umsetzen?

Wir helfen Ihnen, diesen Use Case auf Ihrer Infrastruktur zu realisieren – skalierbar, sicher und DSGVO-konform.

Weitere Use Cases

Video Processing

From Bare-Metal Tinkering to Elastic Video Infrastructure: How ayedo Made Streambase Scalable for …

19.02.2026

SaaS Apps

From VM Operation to Platform: How ayedo’s Planwerk Led to Scalable, Auditable SaaS …

19.02.2026

Machine Learning

From GPU Bottlenecks to Industrial-Scale MLOps: How ayedo Led Sensoriq to a Kubernetes-Based ML …

19.02.2026