Reimagining Network Operations: building a smarter, safer upgrade pipeline

Six months. That’s how long it would take to manually complete a full firmware upgrade cycle across the REANNZ core network. With a greatly increased number of devices to maintain following our recent national network refresh the maths were simple and
sobering - 91 evenings of after-hours work for our engineers to keep us safe from bugs
and vulnerabilities.

This was an operational burden, not a technical one. And it was time to reimagine how
we worked.

Our objective

Our legacy upgrade process - once manageable - had become a bottleneck. We needed
a means to scale operations without scaling toil. The goals were clear: support network
growth, reduce the burden on our engineers, and maintain the reliability our members
trust us to provide.

Through automation our engineers could be freed to spend their time searching for
innovation, confident that rote device upgrades remained safe, predictable, and
auditable. Our security posture could also be improved. When vulnerabilities were
identified, we would be enabled to deploy firmware updates across the fleet quickly
and consistently. No delays, no hazard.

Leveraging our experience

We did not go in blind. Our national network refresh had already taught us valuable
lessons about the limits of manual oversight. During that project, we invested
considerably in developing automated state-checking tools that gave us deep visibility
into device and service health - tools that surfaced subtle issues and helped us to avoid
costly missteps.

We recently shared the story of this project with our Australian colleagues at AusNOG
where we presented a clear message - automation in a modern carrier network is not a
luxury. It is essential.

Building the solution

We built upon our successes during our network refresh and adapted our solution to
our firmware upgrade process. The result was a robust, automated upgrade pipeline
that begins and ends with a detailed snapshot of the state of each device.

It is not sufficient simply to see the device operating on its upgraded firmware. We now
have a system to verify every individual member service is operating to its specification
without disruption from our work.

If there is anything that cannot be verified, the on-call engineer is notified to remediate.
A complete audit log is kept for review by the team, supporting continuous
development of our systems.

And to prevent cascading failures, we designed a failsafe. If an upgrade demonstrates
any unexpected consequence, the system halts all further work until the issue is
identified and resolved.

Looking ahead

This is not the end of our journey. The state-checking logic we developed is already
being introduced for other operational tasks, and increasingly contributes to all of our
service orchestration.

What began to solve one problem has evolved to become our foundation for smarter,
more resilient network operations. And we’re only just getting started.