Software-defined everything is a new trend. How about software-defined outage prevention and remediation?
You have your cloud up and running. You monitor it through StackLight, Zabbix, Nagios or some other tool. But what's happening when one of the services is unresponsive or your free disk space is low? How quickly will you able to resolve the issue? Do you have any debugging information or logs gathered before you actually start digging into the issue?
We will introduce a “robosysadmin” for our production OpenStack cloud that reacts to alerts and outages and helps us to speed up mean time to repair by gathering debug information and trying to fix issues automatically using predefined workflows. It’s a kind of Tier 0 support: it troubleshoots, fixes known problems, escalates to humans when necessary, and provides detailed information on what it has discovered.
Attendees will learn about:
- How we monitor our multi-dc production cloud at Symantec.
- How we approached the problem of cloud auto-healing
- Stackstorm and alternatives for automating prevention and remediation of outages
- Openstack auto-healing workflows we created