Our company, NTT Communications, provides the public/hosted private cloud service "Enterprise Cloud 2.0" with OpenStack. For cloud service providers, downtime reduction is essential and thus many operation tools that support failure detection and analysis have been deployed. In our company, we have been using monitoring tools (TeMIP, Zabbix etc.) since the previous cloud service; however it takes a long time to conduct failure analysis across multi-component OpenStack-based cloud service because failure causes are more complex. Our goal is to develop an effective failure analysis tool. To achieve this goal, we added some functions such as "the automatic analysis of states/logs along the service procedure flow", "the cause suggestion based on dependency learning" after analyzing failure cases. These improvements have helped us conduct a failure analysis of service down (instance creation failure etc.) more quickly. In this presentation, we share our development knowledge and use cases.
- Traditional operation tools for an Enterprise public cloud service
- Challenges we faced in the operation of a multi-components cloud environment (especially in failure analysis)
- How these challenges can be solved. For example, you can learn the following solutions.
- Solution (1): The automatic analysis of states/logs along the service procedure flow composed of some process and DB
- This can solve the above use case(1) because it can analysis the related process precisely.
- For example, this would detect the silent failure in nova-compute corresponding to the service down (fails to create instance).
- Solution (2): The cause suggestion based on dependency learning between requests, performance (CPU, Memory, I/O, NW etc.), architecture and errors
- This can solve the above use case(2) because it can suggest causes comparing to related past cases from various aspects.
- The relation between workload and error can also help capacity planning.