Whether you opt for bare metal servers or virtual machines, you'll need a rock-solid distro. These are my four go-to favorites.
I simulate controlled outages to reveal hidden dependencies and harden recovery. I use tc, namespaces, iptables/nftables and mock services; automate setup and teardown. I plan scope, log everything, ...