,

25+ Site Reliability Engineering OKRs

Readme before reviewing the Site Reliability OKRs below

Please review these guidelines before you consider adapting the OKRs:

  • Many of the OKRs are ambitious examples – certainly more than what most junior SREs should be given or could handle
  • Most OKRs would be the culmination of efforts by an entire SRE team and not a sole engineer
  • Numbers in the OKRs, e.g. 0.75%, have been created for illustrative purposes only – consider your metrics and goals for the numbers

Incident Response OKRs

  • Reduce MTTR for on-call engineers by 5%
  • Develop buffers to ensure incidents remain at < 75% of the error budget
  • Mitigate false positive system alerts to reduce on-call staff costs
  • Speed up the resolution of critical incidents by 5%
  • Increase the coverage of 4-point SLIs from 90% of services to 100%
  • Reduce manual toil from 25% of responder time to 20%
  • Increase increment velocity in SRE project work with one-sprint reduction
  • Reduce operational work from 65% of total work time to 55%
  • Reduce incident recurrence from 8 out of 10 to 6 out of 10 incidents
  • Assure realistic SLA targets in line with current SLIs for > 97.5% of accounts

System performance and resilience OKRs

  • Reduce 50x errors from 1% down to 0.75%
  • Increase failover design of # of microservices from the current 60% to 65%
  • Reduce network latency among the top 5 services by 2.5%
  • Increase average load speed of application by 0.25%
  • Reduce open-source-software-related errors by 10%
  • Reduce incident recurrence from 8 out of 10 to 6 out of 10 incidents
  • Increase black swan event awareness among developers to 90%
  • Plan for handling unexpected high demand up to 25% burst capacity
Related article:  #9 Inside Booking.com’s Site Reliability Engineering Practice

Developer support OKRs

  • Drive rail-guided services from 40% to 50% of all new launches
  • Speed up time to production for images by 20%
  • Improve developer speed-to-publish by 10%
  • Increase tool efficiency to < 2 same-purpose tools per category across teams

DevSecOps OKRs

  • Reduce build security issues by 25%
  • Drive DevSecOps awareness among developers to 75% of the headcount
  • Drive security of database architecture with < 1 major incident per year

FinOps (Cloud Cost Control) OKRs

  • Reduce the cost of stateful storage capacity by 10%
  • Reduce total cloud billing by 1%
  • Reduce vendor-based tool costs by 10%
  • Reduce routine downtime maintenance costs by 3%

Work practices OKRs

  • Increase increment velocity in SRE project work with one-sprint reduction
  • Reduce operational work from 65% of total work time to 55%

Feel free to reach out if you have any questions about the above OKRs or want us to add a new OKR.

Ash Patel
Connect?