TL;DR
- Effective alerting is more than just a few emails at 80% CPU: It requires clean metrics, clear severity levels, thoughtful routing, and throttling to reliably detect relevant incidents without overwhelming the team.
- Incident Response only works as a defined process: Detection → Triage → Investigation → Mitigation → Resolution → Postmortem. Each phase has different goals, actors, tools, and artifacts.
- NIS-2 and DORA require not just “any response”, but traceable, documented procedures – including early warning within 24 hours, initial report after 72 hours, and final report within 30 days.
- With VictoriaMetrics/Grafana for alerting, VictoriaLogs for forensics, and GitLab (Issues & Wiki) for tracking and postmortem reports, you can build an Incident Response chain that is both technically robust and auditable.
- ayedo combines a Cloud-Native platform with integrated monitoring, logging, and compliance components, allowing European organizations to pragmatically implement NIS-2 and DORA-compliant incident processes and work with a practical Incident Response playbook.
Why Structured Alerting and Incident Response Are Mandatory Today
As the European tech industry, we are at a point where professional Incident Management is no longer “nice to have” but a regulatory standard.
The EU directive NIS-2 (adopted in 2023, national implementation by October 2024) and the Digital Operational Resilience Act (DORA) for the financial sector (applicable from January 17, 2025) explicitly require:
- effective Incident Handling,
- documented response processes,
- and reporting obligations with clear deadlines.
This is not just a compliance burden. Properly setting up alerting and incident response brings:
- better availability,
- faster recovery in case of failure,
- and significantly less stress when something critical happens.
At its core, it’s about connecting three worlds:
- Monitoring & Alerting (e.g., VictoriaMetrics, Grafana)
- Incident Response Process (Detection to Postmortem)
- Compliance Requirements (NIS-2 / DORA, including reporting obligations)
Let’s look at these components in a structured way.
Alerting: From Metric to Meaningful Alerting
Alerting Rules with Prometheus/VictoriaMetrics
The technical foundation consists of metrics and logs. Many organizations use Prometheus semantics – in some cases directly in VictoriaMetrics, which supports Prometheus APIs.
It’s important to separate:
- Metrics: raw observation data stream
- Dashboards: visual representation
- Alerting Rules: explicit conditions for when human intervention is needed
Best practices for alerting rules:
- Focus on user impact (e.g., error rate, latency) instead of purely internal metrics.
- Use meaningful time horizons (e.g., 5-10 minutes instead of one-minute spikes).
- Define rules based on SLO violations (Error Budget) instead of reporting every minor fluctuation.
In VictoriaMetrics or Prometheus, these rules are centrally maintained. They can be well visualized and tested via Grafana.
Alert Severity: Clear Severity Levels Instead of Gut Feeling
Without consistent severity levels, incident management quickly becomes political: “Why is my issue only medium?” This can be avoided by setting a few clearly defined levels and backing them with criteria:
- Info: Observations without immediate action required (e.g., successful deployment).
- Warning: Potential problem requiring intervention in the medium term (e.g., capacity >70%).
- Major: Users are affected, but workarounds or degradation are possible (e.g., increased error rate in a subservice).
- Critical: Business or security-critical functions are severely impacted; immediate action required.
These severity levels should be harmonized with NIS-2/DORA requirements: Typically, only Major and Critical are reportable. The criteria for when an alert becomes a reportable “incident” belong in your governance – not in the night of the on-call engineer.
Alert Routing: The Right Alert to the Right Person at the Right Time
Good alerting is targeted. Typical routing principles:
- By Service/Domain: Application teams, database team, platform team.
- By Severity: Critical → immediate on-call alerting, Warning → asynchronous in ticket queue.
- By Business Relevance: Alerts potentially relevant to NIS-2/DORA are additionally sent to Security/Compliance.
In practice:
- Alertmanager (in the Prometheus/VictoriaMetrics ecosystem) or Grafana Alerting controls,
- where an alert arrives (pager, chat, email, ticket system),
- who is on-call for which time window.
The assignment should be configuration-driven and documented. This allows you to demonstrate to auditors that there is a regulated process, not just “someone” responding by chance.
Alert Throttling: Filtering Out Noise, Retaining Signals
Serious disruptions often generate dozens of similar alerts. Without throttling and grouping, this leads to:
- alert fatigue,
- reduced responsiveness,
- and chaos in the postmortem.
Mechanisms you should consistently use:
- Deduplication: Identical alerts within a time window are consolidated.
- Grouping: Related alerts (e.g., multiple services in the same cluster) are grouped into one incident.
- Rate Limiting: Upper limits are defined per channel and recipient.
The art is to keep services visible without overwhelming people with detail noise. Your alert policy should, for example, define: “A cluster-wide ‘High Error Rate’ alert replaces individual alerts for all affected services.”
Incident Response: From Detection to Postmortem as a Process Chain
An incident is more than a loud alert. To remain NIS-2/DORA-compliant and operationally manageable, we view Incident Response as a chain of six phases:
- Detection
- Triage
- Investigation
- Mitigation
- Resolution
- Postmortem
1. Detection: From Alert to Incident
Not every alert becomes an incident. Detection means:
- An alert with a certain severity level is triggered (e.g., by VictoriaMetrics / Grafana Alerting).
- The on-call person briefly assesses: Is it a real problem or expected behavior?
- From a defined severity level, an incident is opened, typically as an issue in the GitLab project “Incidents”.
This immediately creates:
- a unique Incident ID,
- a central communication and documentation location,
- and the basis for later reports (NIS-2/DORA).
2. Triage: Classification and Initial Decisions
In triage, the first minutes clarify:
- Scope: Which systems/customers are affected?
- Impact: Loss of functionality? Data security at risk?
- Severity: Which predefined level applies?
- Reporting Obligation: Potentially NIS-2/DORA relevant?
Typical activities:
- Quick glance at dashboards in Grafana.
- Check status pages and synthetic monitoring.
- Communicate with support/business to assess external impact.
The result is documented in the GitLab issue: time, initial assessment, severity, whether regulatory relevant.
3. Investigation: Understanding Causes
In the investigation phase, you use forensics tools to find the cause:
- VictoriaLogs as a central log platform for application, infrastructure, and security logs.
- Correlation of metrics (VictoriaMetrics) and logs (VictoriaLogs), for example: Increasing error rate correlates with specific exceptions or deployment times.
- Review of configuration and secret changes; here, a clean secrets management via External Secrets Operator ESO supports, as changes are traceably versioned.
Investigation goal:
- Root Cause Hypothesis: What actually went wrong?
- Exploit vs. Misconfiguration: Security incident or operational error?
- Risk Assessment: Is data integrity or confidentiality affected?
4. Mitigation: Limiting Damage
Mitigation means reducing impact, even if the root cause is not fully resolved. Typical measures:
- Rollback to a known, working version.
- Temporarily disable individual features.
- Increase capacities to absorb load peaks.
- More restrictive security rules (e.g., firewalls, RBAC adjustments).
All measures should be:
- documented in the GitLab issue: time, responsible, expected effect, actual effect;
- visible in monitoring & logging to verify their effect.
5. Resolution: Restoration and Closure
Resolution is achieved when:
- Service levels are back in the green (metrics evidenced by VictoriaMetrics and Grafana);
- the root cause is resolved, not just patched;
- all temporary mitigation measures are either cleanly integrated or rolled back.
The incident is marked with a clear status (e.g., “Resolved”) in GitLab, all times are documented:
- Start of the incident
- Start of mitigation
- Recovery time
- Final resolution
These timestamps are invaluable later – both for SRE analyses and NIS-2/DORA reports.
6. Postmortem: Learning and Compliance Reports
Postmortems are not blame documents but learning tools. They are also excellent bases for:
- the 72-hour report (initial assessment)
- and the 30-day final report under NIS-2 / DORA.
Structured postmortems can be well maintained in the GitLab Wiki, linked to the incident issue. Typical contents:
- Chronology of the incident
- Technical root cause analysis
- Decision points and alternatives
- Impact on users and business
- Lessons learned and concrete follow-up tasks
- Assessment of whether and how NIS-2/DORA reporting obligations were met
Practical Example: “High Error Rate” – From Alert to Final Report
Let’s take a concrete scenario:
Detection: High Error Rate
An alerting rule in VictoriaMetrics triggers:
- Condition: HTTP 5xx rate of a core service >5% over 10 minutes
- Severity: Major
- Routing: on-call SRE team via pager, additionally incident channel in chat
The on-call person opens an incident issue in GitLab titled “High Error Rate in Payment Service”, timestamp, and reference to the corresponding Grafana panel.
Triage: Severity and Reporting Obligation
In the first 15 minutes:
- Grafana shows that only part of the requests are affected, but all checkout flows are impacted.
- Support reports increased error rates in customer tickets.
- There are no indications of data leakage, but potentially lost revenue.
Result:
- Severity is raised to Critical.
- It is marked: “Potentially NIS-2 relevant”, as the availability of a critical service is impaired.
- Security/Compliance are added to the GitLab issue.
Investigation: Metrics + Logs
The SREs use VictoriaLogs:
- Filtering for errors in the payment service in the last 30 minutes.
- Correlation with deployments: Shortly before the problem occurred, there was a new release.
- Error pattern shows a specific exception in the connection to an upstream system.
Hypothesis: Misconfiguration in