One of the primary requirements of an enterprise customer hosted on a private cloud is guaranteed availability of their workload. Although Openstack natively supports some forms of HA, there is a big gap from the enterprise perspective.
Specifically the gaps are if a compute node goes down, the VMs have to be manually evacuated. With most platforms having Predictive failure detection mechanisms, there is no way to live-migrate VMs to healthier nodes either.
This proposal addresses these gaps thru developing a HA framework that is both ReActive and ProActive. With reactive HA, we will show how the VMs running on a compute node which goes down, can automatically be restored.With proactive HA, we will show how the framework triggers Live Migrations in case of high thermal signature, predictive hardware failures, host maintenance mode, slow application performance due to bottlenecks, and other custom triggers (ceilometer alarms, zabbix) thus mitigating VM down-time for enterprise workloads
Attendees can expect the following:
- How to setup an enterprise grade HA framework with tools like pacemaker_remote, zabbix, ceilometer etc
- How to setup actions like evacuate and live migration based on live triggers
- How to use platform features to predict hardware errors, detect thermal signatures and mitigation techniques, monitor VM performance
- How to manage health of compute nodes