I modeled traffic-weighted SLOs as probability chains in PromQL

Y	Hacker News new \| ask \| show \| jobs

1 points by lep_qq 107 days ago

Most SLO tools treat a user journey as binary: either all services are up, or the whole thing is down. That breaks when traffic doesn't flow uniformly through all your services.

The checkout SLO that lied Three services: checkout-base (99.9%), payments (99.95%), coupon (99.5%). A naive AND composition gives a system SLO of ~99.35%.

But 90% of users never hit the coupon service. Only 10% go through base → coupon → payments. The coupon service drags the number down, but it only affects a tenth of my traffic.

The correct formula is:

e_total = 1 - ( 0.9 × (1 - e_base) × (1 - e_payments) + 0.1 × (1 - e_base) × (1 - e_coupon) × (1 - e_payments) ) Each route is a chain where all services must succeed (multiply success rates). Weights represent traffic share and must sum to 1.

Translating this into PromQL PromQL has no native "product of a set" operator. What it has is scalar(), which collapses a single-element vector into a scalar — exactly what you need when each slok:sli_error_rate recording rule returns one value.

The generated rule for a 5m window:

1 - ( 0.9 * ( (1 - scalar(slok:sli_error_rate:5m{slo_name="checkout-base-slo",...})) * (1 - scalar(slok:sli_error_rate:5m{slo_name="payments-slo",...})) ) + 0.1 * ( (1 - scalar(slok:sli_error_rate:5m{slo_name="checkout-base-slo",...})) * (1 - scalar(slok:sli_error_rate:5m{slo_name="coupon-slo",...})) * (1 - scalar(slok:sli_error_rate:5m{slo_name="payments-slo",...})) ) ) scalar() is load-bearing. Without it you'd be multiplying labeled vectors with different label sets — PromQL would try to join them and fail.

The rule is generated for each evaluation window (5m, 1h, 6h, 3d, 7d, 30d) and stored as slok:sli_error_composition_rate:WINDOW. Everything downstream — burn rate, alerts, status — consumes this single metric without knowing how it was produced.

The YAML interface

kind: SLOComposition spec: target: 99.9 window: 30d objectives: - name: base ref: { name: checkout-base-slo } - name: payments ref: { name: payments-slo } - name: coupon ref: { name: coupon-slo } composition: type: WEIGHTED_ROUTES params: routes: - name: no-coupon weight: 0.9 chain: [base, payments] - name: with-coupon weight: 0.1 chain: [base, coupon, payments] With the composed error rate as a recording rule, the standard multi-window burn rate pipeline works unchanged. The alert fires when the composed journey is burning budget too fast — not when any single service degrades, but when the degradation actually impacts users at the rate the weights predict.

Limitations scalar() assumes each input recording rule returns exactly one series. If a query matches multiple series, scalar() returns NaN and the composition breaks silently. Also: duplicate alias detection in routes isn't enforced yet by the webhook.

This is alpha. Feedback on the API shape welcome.

Repo: https://github.com/federicolepera/slok