Building auto-healing cluster is always not an easy job in cloud environment. How to accurately detect failures happen in different layers; how to promptly make fencing to prevent further damage; how to make recovery progress automatically and efficiently, all these headaches need to be addressed before we can announce our systems/applications as auto-healing. In this presentation, we will deep dive into Senlin's health management design to introduce how we address all these issues and fill in the gap.
Attendees will learn how to build/deploy an auto-healing cluster for some typical application in OpenStack cloud:
(1) Choose proper metrics, events for failure detection
(2) Choose proper recover actions/sequence for the target cluster
(3) Build and customize own health management policy to enable auto-healing for the target cluster
(4) Extend the auto-healing loop by cooperation with other telemetry, workflow, event services.