Driven by the demand to support the world's largest particle collider, the CERN IT department decided in 2012 to radically change and to build up an "Agile Infrastructure" -- centered around an OpenStack based private cloud. Since then, the CERN cloud has grown to ~300k cores and supports not only the physics programme, but also the majority of administrative and support services.
In this 5-year perspective, we will review some of our operational war stories. Concepts to simplify day-to-day operations, such as automating/outsourcing tasks via a job scheduler/orchestrator or the introduction of staged rollouts to mitigate deployment risks will be presented alongside experiences from cloud-wide campaigns, such as the handling of security vulnerabilities, the mass-migration of guests due to hardware retirements, or the elimination of a physical/virtual performance gap. The solutions to puzzling issues, such as intermittent VM shutdowns or data loss on reboots, will also be unveiled.
Attendees should expect to
- get a status overview of the current architecture of the CERN OpenStack deployment;
- learn the techniques and tools we use for daily operations and which allowed the service to scale;
- understand the way we organise cloud-wide campaigns that affect several thousand users (illustrated by concrete examples, such as the roll-out of security patches and a corresponding complete infrastructure restart);
- have some fun with "exotic" problems we encountered (such as being haunted by mysterious VM shutdowns or unexpected complete data loss on Cinder volumes upon instance reboot)!