Astrid Atkinson
Astrid AtkinsonCEO/CoFounder
CEO/Cofounder @CamusEnergy , x-googler.

Astrid shares her thoughts on large scale system management and potential learnings for grid operators. Join the discussion on Twitter.

Note: This blog post (and twitter thread) was later expanded into a second blog post here.

We just finished sharing thoughts on large scale monitoring and telemetry with MISO, and I wanted to share a few reflections on applying lessons from internet scale systems to managing the electrical grid.

(Firstly, the terminology is awkward. we’d describe what we did at Google as “distributed systems engineering”, but in utilities every one of those words has a different and specific meaning. Distribution *grids* have systems engineers, who design physical networks – not software.)

Grids are in the process of transitioning from a model with a small number of big, reliable generators, to a much larger number of less-reliable participants. This isn’t so different from the transition from a centralized to a distributed computing model. The foundational requirement of distributed systems reliability is monitoring. In order to get multiple participants to work well together, you need to know what they’re doing – and in particular, whether the system as a whole is accomplishing its goals. Once you have a foundation which allows you to measure whether the system itself is working, it’s much easier to change it. Now, if a change causes a problem, you can quickly identify the failure. This provides both operators and engineers with more flexibility. As you shift from a centralized model to a distributed one, reliability is less about planning and more about response. As the number of participants increase, the odds of any given participant failing also increases. And failures between components interact.

At sufficient scale, behavior of the system is nondeterministic. So now, all the planning in the world is useful but not sufficient. You also need to observe the system in real time. Monitoring shifts from a nice-to-have to a critical requirement. There are basically two ways to monitor the behavior of a large system.

So if we want to support reliable operations of the grid with lots of participants, the first question most grid folks ask is, how do we get reliable telemetry? But a distributed systems approach asks, how do we get reliable system insight from *unreliable* telemetry?

The answer to this is cross comparison across multiple sources. These sources mostly already exist, but aren’t usually considered together. But the system level magic is in the comparison. 

Consider: if an endpoint stops reporting, did it fail?

Well, maybe. Maybe communications glitched. What does the network telemetry say? If network level behavior is unaffected, it’s probably a comms failure. If network telemetry also shows a problem, it’s either a device or local network failure. If network level reporting shows a failure and the device does not, then probably grid connectivity was interrupted. This is a simple example, but you get the idea. The interactions tell us more than the nodes do. To measure system level behavior you need a system level approach. Over time, we found that it was possible to get extremely reliable behavior from a rabble of unreliable nodes with extremely lightweight telemetry at the edges.

Interestingly, monitoring was never quite as perfect – the long tail of insights always came from the operators. So while this is a dense topic, one of the big lessons is that monitoring doesn’t exist in a vacuum. It’s a tool for helping humans manage complexity. In particular, it helps us manage *change*. If you want to evolve a system, first you have to be able to see it.

  1. Measure all the endpoints and infer the state of the system
  2. Measure the system and infer the state of the endpoints

The most reliable monitoring approaches do both, and cross-check the results. Most grid management today relies on the former. Measure the endpoints (the generators or consumers) and then run a physics model of the grid. This begins to fall down when you have a lot of endpoints which are not perfectly measurable.