Monitoring

From Log Debugging to True Observability: How ayedo Stabilized a SaaS Platform with APM on Kubernetes

Many SaaS companies invest early in clean architecture: containerized workloads, Kubernetes, Infrastructure as Code, GitOps. Technically, everything seems modern. Yet, a crucial component is often missing: transparency over the actual runtime behavior of the application.

In this post, we demonstrate through an anonymized project how ayedo built a scalable Application Performance Monitoring (APM) for a SaaS provider with several million active users, based on VictoriaMetrics, VictoriaLogs, Grafana, and OpenTelemetry—fully integrated into the existing Kubernetes infrastructure.

The client remains anonymous. The approach is transferable—especially for SaaS companies that are growing and realizing that “looking at logs” is not an observability strategy.

Initial Situation: Cloud-native Architecture – but No System Understanding

The client operates a SaaS platform for digital process automation and data integration with over five million active users monthly. The engineering team is lean: about 10 people for Development and DevOps. The application runs fully containerized on Kubernetes, with deployments automated via GitOps.

Architecturally, much was implemented correctly: stateless services following the 12-Factor principles, PostgreSQL as a persistent base, Redis for caching, S3 for object storage. Infrastructure and delivery were set up cleanly.

What was missing was the systematic observation of the system.

Logs were pulled ad hoc from individual pods—often only after an error was already visible. There was no central aggregation, no structured correlation between logs, metrics, and specific user requests. Every incident began with the same question: “Where do we look first?”

With the increasing complexity of the platform and a growing number of users, typical growth problems emerged: regressions after releases, N+1 query issues, sporadic latency spikes. Technically, the causes were solvable—operationally, the analysis was too slow.

The Symptoms: Reactive Error Analysis Instead of Proactive Control

Without comprehensive observability, the focus inevitably shifts to reactive troubleshooting.

Error analyses took hours instead of minutes because logs were contextless. A single stack trace reveals little if it’s not clear which user request triggered it, which services were involved, and how CPU, memory, or database metrics behaved at that moment.

Performance issues were often only noticed when end-users felt them. For a platform with millions of users, this immediately means increasing user churn, support tickets, loss of trust.

At the same time, there was growing uncertainty within the engineering team. When changes were deployed, it was unclear what side effects they would trigger in the system. Without metrics and traces, every optimization remains somewhat of a blind flight.

The problem was not the architecture. It was the lack of correlation between code, infrastructure, and real runtime behavior.

ayedo’s Approach: Observability as a Platform Standard – Not an Add-on

Our goal was not to “introduce a monitoring tool,” but to build a holistic observability layer that is deeply integrated into the existing Kubernetes environment and understands metrics, logs, and traces as a cohesive system.

The solution was based on a modular, scalable stack:

VictoriaMetrics as a time-series database for high-performance metric collection,
VictoriaLogs as a central log backend,
Grafana for visualization and alerting,
Grafana Tempo for distributed tracing,
OpenTelemetry as a standardized instrumentation layer.

The crucial aspect was that this stack does not exist alongside the application but becomes an integral part of the operational model.

OpenTelemetry: Instrumentation Instead of Log Collecting

The first step was integrating OpenTelemetry (OTel) into the application. Instead of relying solely on infrastructure metrics, metrics, logs, and traces were generated directly from the services.

Every relevant request now generates a trace chain across all involved services. Database calls, external API requests, cache accesses—everything becomes visible as part of a transaction.

This shifts the perspective from “Pod threw an error” to “User request X caused a 1.8-second latency in service Y, triggered by Z database queries.” That is a qualitative difference.

VictoriaMetrics & VictoriaLogs: Scalability Without Complexity Overhead

For storing the metrics, we used VictoriaMetrics—a high-performance, resource-efficient time-series database that remains stable even with millions of time series.

Especially in rapidly growing SaaS environments, scalability in monitoring itself is an issue. Traditional Prometheus setups quickly reach their limits with high cardinality. VictoriaMetrics allows horizontal scaling and long-term storage without exponential resource requirements.

Logs are centrally aggregated via VictoriaLogs. Instead of distributed “kubectl logs,” there is now a consistent, searchable log stream across all services. Errors can be directly correlated with specific metrics and system states via trace IDs.

Grafana & Tempo: Context in One Interface

Grafana serves as the central interface for the entire team. Dashboards not only show CPU and memory values but also application metrics like request duration, error rates, query latencies, or cache hit rates.

With Grafana Tempo, distributed traces are visualized. A single click on an error message leads to the complete transaction chain—including all involved services.

This fundamentally changes error analysis. Instead of interpreting isolated log lines, engineers see the behavior of a request in the overall context.

Infrastructure and Application Metrics in Interaction

A significant advancement was the correlation of infrastructure and application data.

Cluster-wide metrics like CPU load, memory usage, or network traffic are continuously captured and displayed in the context of application traces. This makes it visible whether a latency spike is caused by an inefficient query, an overloaded node, or an external API timeout.

This contextualization is the difference between monitoring and observability.

Intelligent Alerting: Context Instead of Noise

Alerting was not implemented as “threshold on CPU > 80%” but context-based.

When error rates rise or request latencies exceed certain SLO limits, alerts are sent directly to the team’s Slack channels—including a link to the dashboard, trace, and relevant logs.

This reduces false positives and significantly speeds up the response. Instead of collecting data first, the analysis begins immediately.

Result: From Hours to Minutes – and from Reaction to Control

After introducing the APM stack, the team’s operational routine changed significantly.

Error analysis times were drastically reduced. What previously took several hours—collecting logs, establishing context, testing hypotheses—can now be traced in a few minutes.

Regressions after deployments are detected early, often before they become noticeable to end-users. Performance degradations become visible through trends, not just during outages.

The platform’s stability has increased, and the user experience has measurably improved. At the same time, the team’s architectural understanding grew: bottlenecks, inefficient components, or problematic dependencies are identified based on data—not assumptions.

A side effect that is particularly valuable strategically: observability data is now actively used for product decisions. Feature usage, load profiles, performance correlations—all flow into prioritization and scaling decisions.

Why APM on Kubernetes Is No Longer a “Nice-to-have”

With a growing number of users, not only does the load increase, but so do expectations. SaaS customers do not tolerate performance uncertainty. At the same time, architectures become more complex—microservices, asynchronous processing, caching layers.

Without comprehensive observability, this complexity is unmanageable.

A scalable APM system based on VictoriaMetrics, Grafana, and OpenTelemetry provides exactly this transparency—without proprietary lock-ins and without making the monitoring infrastructure itself a problem.

Call to Action

If your SaaS platform is still primarily operated through log debugging in production today, it is a growth risk—even if the architecture seems modern.

ayedo supports the setup of scalable Application Performance Monitoring systems on Kubernetes—with VictoriaMetrics, VictoriaLogs, Grafana, Tempo, and OpenTelemetry, fully integrated into your existing platform.

If you want to not only know if your application is running but understand how it behaves, let’s talk about your observability strategy.

From Log Debugging to True Observability: How ayedo Stabilized a SaaS Platform with APM on Kubernetes

From Log Debugging to True Observability: How ayedo Stabilized a SaaS Platform with APM on Kubernetes

Initial Situation: Cloud-native Architecture – but No System Understanding

The Symptoms: Reactive Error Analysis Instead of Proactive Control

ayedo’s Approach: Observability as a Platform Standard – Not an Add-on

OpenTelemetry: Instrumentation Instead of Log Collecting

VictoriaMetrics & VictoriaLogs: Scalability Without Complexity Overhead

Grafana & Tempo: Context in One Interface

Infrastructure and Application Metrics in Interaction

Intelligent Alerting: Context Instead of Noise

Result: From Hours to Minutes – and from Reaction to Control

Why APM on Kubernetes Is No Longer a “Nice-to-have”

Call to Action

Diesen Use Case umsetzen?

Weitere Use Cases

Video Processing

SaaS Apps

Machine Learning

Monitoring

From Log Debugging to True Observability: How ayedo Stabilized a SaaS Platform with APM on Kubernetes

From Log Debugging to True Observability: How ayedo Stabilized a SaaS Platform with APM on Kubernetes

Initial Situation: Cloud-native Architecture – but No System Understanding

The Symptoms: Reactive Error Analysis Instead of Proactive Control

ayedo’s Approach: Observability as a Platform Standard – Not an Add-on

OpenTelemetry: Instrumentation Instead of Log Collecting

VictoriaMetrics & VictoriaLogs: Scalability Without Complexity Overhead

Grafana & Tempo: Context in One Interface

Infrastructure and Application Metrics in Interaction

Intelligent Alerting: Context Instead of Noise

Result: From Hours to Minutes – and from Reaction to Control

Why APM on Kubernetes Is No Longer a “Nice-to-have”

Call to Action

Diesen Use Case umsetzen?

Weitere Use Cases

Video Processing

SaaS Apps

Machine Learning

Kontakt aufnehmen