Real-life scenarios show that searching for the root causes of failures in complex systems can be very complicated and time consuming, leading, in the worst case, to lengthy outages. Therefore, operational real-time monitoring of the infrastructure is crucial to be able to quickly identify and alert on potential problems. But monitoring is not enough; when the failure occurs in a backend component such as memcached or database, configured alarms will typically cause an avalanche of notifications on all affected services and resources. A flexible mechanism for defining and analyzing relations between them is urgently needed.
In this presentation, we will show you how this can be achieved with Monasca and Vitrage, two OpenStack projects working together under the umbrella of the Self-Healing SIG. We will also refer to other possible integration points to implement fully automatic remediation.
In this presentation we demonstrate how Monasca and Vitrage can be efficiently used together to proactively detect problems and prevent possible outages.
Finally we show how the observability of OpenStack services and our solution can be improved by implementing Healthcheck APIs.