Announcing SLO alerting
Today we are extremely excited to announce support for Service Level Objective (SLO) alerts in Capture. SLO alerts are an industry best practice for operating reliable systems and they are now available as part of the bitdrift Capture dynamic observability system.

SLOs are a method of applying rigor to organization reliability objectives. An example of an SLO is: 99.9% of app start Time To First Interaction (TTFI) should be less than 5s. SLOs operate over windows (the most common being 7 or 30 days) and effectively allow for a certain amount of failure as part of normal operation. This failure is known as the error budget. For a much deeper dive on what SLOs are and how to use them see the chapter on implementing SLOs in the Google SRE book.
Once SLOs are defined, they need to be alerted on. This may sound simple, but is deceptively complicated, as evidenced by the length of the accompanying chapter on alerting on SLOs in the Google SRE book.
Up to now, Capture has only supported “basic” alerts on synthetic metric time series data. For example, alert when a synthetic counter is greater than 100 for 10 consecutive aggregation periods. With this release, it is now possible to define and then alert on SLOs within Capture. The following functionality is supported:
- 7 or 30 day SLOs.
- SLOs are supported on synthetic rate charts where the rate is assumed to be good events / total events. For example, the rate of TTFI 5s as described in the example above.
- We have implemented Multiwindow, Multi-Burn-Rate Alerts (MWMBR) as defined in part 6 of the Google SRE handbook chapter. This is the optimal method of alerting as it balances both good precision (how many alerts are significant) as well as good recall (how many significant events are detected).