PagerDuty Summit 2020

Sep. 24, 2020 · 5 min read

PagerDuty Summit 2020

40+ hours of sessions and keynotes – How much time do you have?

Sessions

Incident Analysis: Your Organization’s Secret Weapon

Nora Jones, Co-founder, Jeli

Successful organizations don’t just react to incidents—they use them to learn how to be proactive. To build a stronger system, safeguarded from making the same mistakes again. Extensive research has mapped out Incident Analysis as a strategy to help teams learn, adapt, and prioritize by focusing on what matters. More focused efforts help us move faster. And farther. Which is why incident analysis could be your organization’s secret weapon. It could change the game, and keep you on top of yours. So, are you ready to turn your “failures” into success?

The incident:

  • is a catalyst to understanding where you need to improve your sociotechnical system
  • is a catalyst to showing you what your Organization is good at, and what needs improvement
  • helps you understand the difference between
    • how you structure your on-call teams in theory vs.
    • how you structure them in practice

Good incident analysis can help you with:

  • Headcount
  • Training
  • Unlocking tribal knowledge
  • Quantifying how much coordination efforts during incidents cost
  • Understanding bottlenecks

What can you do today to improve incident analysis?

  • Give folks time and space to get better at analysis. This can be trained
  • Come up with different metrics: Look at the people
  • Investigator on-call rotations
  • Allow time to investigate the “big ones”

How do you know if it’s working?

  • More folks are (voluntarily) reading the incident review
  • More folks are attending the incident review
  • You’re not seeing the same folks (i.e. heros) pop into every incident
  • Folks feel more confident
  • Teams are collaborating more
  • Better shared understanding of the definition of an incident

CI/CD and the Rise of CV

Casey Rosenthal, Co-Founder, Verica

Like CI/CD, Continuous Verification is born out of a need to navigate increasingly complex systems. Modern organizations can’t validate that the internal machinations of the system work as intended, so instead they verify that the output of the system macthes expectations.

Continuous Verification is a proactive, experimentation tool for verifying system behavior.

Myths in complex system design and operations. Improve your availability by:

  • Myth 1: Removing the people who cause accidents (i.e. the “bad apple” concept)
  • Myth 2: Documenting best practices and runbooks
  • Myth 3: Defending against prior root causes (“at best, you’re wasting your time if you’re doing RCA”)
  • Myth 4: Enforcing procedures (“confusing management’s notion of work as idealized, with the engineer’s notion of work actually as done”)
  • Myth 5: Avoiding risk
  • Myth 6: Simplifying a complex system
  • Myth 7: Adding redundancy

Chaos Engineering is the facilitation of experiments to uncover systemic weakness

https://principlesofchaos.org/

Building Resilient Engineers

Brian Rutkin, Senior Staff SRE, Twitter

It’s not bad or wrong for a human to need to take action, we just want to reduce that burden of taking action to the safest, quickest, and easiest solution possible and get back to the system taking care of itself.

Communication:

  • Get better at communication and everything gets easier

Simplify:

  • Make answers obvious/easy to find
    • owners, escalation path, documentation, dashboards, error index
  • Make answers easy to understand
    • ensure a 3am sleep addled brain can understand the problem
  • Bias answers toward action
    • know what steps to take to mitigate
  • Answers should empower individuals to have agency
    • escalation means you have more work to do

Oncall and Incident Preparation:

  • Runbook
  • Training/Onboarding/Shadowing
  • Roles during incident (active troubleshooter, communications, support)
  • Escalation agreement
  • Practice dealing with incidents (game days, chaos testing, failovers, load testing)

Fatigue and Burnout:

  • Each stressor that we bear reduces the energy we have left to apply directly to our work
  • A lack of agency or the ability to bring about change
  • Having an unclear mission or unclear directive/goal
  • Invisible stressors

Building Resilience:

  • (Business/Employee) Resource Groups
  • Allies
  • Mentorships
  • Sponsorships
  • Unconscious Bias Training
  • Focus Time
  • Self-Care

Building and Scaling SRE Teams

Tammy Bryant (Butow), Principal SRE, Gremlin

In the Heat of the Page: Coping with the Root Cause of Incident Stress

J. Paul Reed, Senior Applied Resilience Engineer, CORE Team, Netflix

Most SREs, if they’re being truly honest, will admit that being involved in an incident, either as a responder or incident commander, is… inherently stressful. But why is that? What makes incidents “stressful?” We’ll take a look at some of the underlying reasons why we experience stress when engaged in incidents.

Team Common Ground

  • Basic Compact
  • Goal Alignment and Commitment
  • Interpredictability (with other actors in the system)
  • Sustain and Repair

Reducing (cognitive) stress during incidents

  • Practice interpredictability
  • Learning to “move” through time: Practice assessing system state and “moving” between different timespans of discretion
    • Venn diagram: probable, plausible, possible, preferable

Adaptive Capacity

  • Get a sense of your current team’s capacity for adaptation
  • Understand “degrees of freedom” and constraints
  • Deliberately create space to share more stories