@WalmartLabs we were tasked to build clouds that would run production workloads to take on the daily site traffic and the peak Holiday traffic. To give you an idea, we get about XXXXX Billion hits over the holiday week, run over XXXX K Applications, XXXXX K nodes, and run collectively about 20+ Production and Non production clouds. With scale came the challenges of Operations and availability of these clouds. We redesigned the way we monitor our clouds, the way we monitor trend analysis, built self-healing, and automation. leveraged Rally to performance test the clouds. We use OneOps as the PaaS layer that manages the VM life cycle, vm auto repair/ replace, code deployment. We manage and operate these clouds by keeping it Simple
P.S- All XXXX numbers will be added for presentation later
The message we would like to share, keep your installs/ distros/ automation simple. You don’t need an army of operations Engineers if you stick to the basics, and manage a clean Environment.