I once heard someone say that doing a major network upgrade on a live production network is not unlike performing open heart surgery on someone – while they are running a marathon. In the world of The Internet, one bad keystroke can be as dangerous as an unanticipated slip of the scalpel.
Over the past 8 months, the Serverbeach network has undergone a series of upgrades that definitely felt like open heart surgery. It’s challenging to truly do the scope of these network upgrades justice, but I’d like to attempt to give this project a bit of perspective.
We spent months designing, planning, organizing, testing, and re-testing. Planning for any possible issues that might come up, because there’s no such thing as an undo button on a live production network. No amount of planning can make that open heart surgery any easier, It simply means you’re more aware of the difficulties that might arise, and are suitably prepared for them.
Although a datacenter network can range from modest all the way to the Googlesque, they can all be broken into 3 layers. The Serverbeach network upgrades touched every single layer in each data center. We completely upgraded the top two layers, essentially transplanting the heart and all supporting muscle and valves. On top of the transplant, we also reconfigured the whole bottom layer of the network, which would equate to making every vein and artery more efficient. In the end we succeeded in completely overhauling Serverbeach’s circulatory system.
For a little more directness on the scope of this project, here are some of the numbers that make up what the 5 person backbone network engineering team worked through:
Number of man-hours spent planning, coordinating, ordering, designing, meeting, preparing, decommissioning, documenting, diagramming: 1800
Number of man-hours spent in after hours change control windows: 300
Number of man-hours spent in after hours emergency change control windows: 18
Number of after hours Change controls performed: 16
Number of after hours Emergency Change controls performed: 1
Number of Kilometers/Miles flown: 33,800 / 21,000
Number of hours spent in airplanes: 55
Number of network devices reconfigured locally while sitting within the datacenter : 380
Number of hardware failures that occurred within Change Control windows: 6
To allude back to the original metaphor of comparing a network upgrade to open heart surgery, a successful surgery requires more than just a surgeon to be a success. Just like a surgeon is aided by a variety of resources, so too was the backbone network engineering team aided by an incredible variety of teams within Peer1. Without this level of coordination of efforts and awesome team work, these upgrades would not be possible.
Guest author Ben “Dr. Network” Kennedy is a hockey player, network engineer, and aspiring open heart surgeon who has developed a mild case of Stockholm syndrome ever since the 8 month network maintenance window has closed.