One of the biggest promises of the cloud vision was the idea that all the infrastructure could be managed in a policy-driven fashion, reacting to failures and other events by automatically healing and optimising services. Most of the components required to implement such an architecture already exist, e.g.
- Monasca: Monitoring
- Aodh: Alarming
- Congress: Policy-based governance
- Mistral: Workflow
- Senlin: Clustering
- Vitrage: Root Cause Analysis
- Watcher: Optimization
- Masakari: Compute plane HA
- Freezer-dr: DR and compute plane HA
- Heat: Orchestration
- Doctor: Fault management and maintenance for NFV
- Fault Genes (WG): Fault Classifications & Recovery Strategy
- Craton: Fleet management
However, there is not yet a clear strategy within the community for how these should all tie together.
At the Queens PTG in Denver there was a kick-off meeting for this self-healing initiative which was well-attended with representation from many of the involved projects.
It was agreed that a SIG should be formed, with bi-weekly IRC meetings. One of the first deliverables planned after officially forming the SIG (which has not happened yet but hopefully will soon) was collecting real-world use cases for self-healing infrastructure, and for that we need as much feedback from operators as possible. The Forum is an ideal opportunity to push this forwards and start planning next steps in this area for the community as a whole.
https://etherpad.openstack.org/p/self-healing-rocky-forum