Event Details

Please note: All times listed below are in Central Time Zone

<< Go back

Failure Analysis Under a Multi-Components Public Cloud Environment

Ops Tools

Our company, NTT Communications, provides the public/hosted private cloud service "Enterprise Cloud 2.0" with OpenStack. For cloud service providers, downtime reduction is essential and thus many operation tools that support failure detection and analysis have been deployed. In our company, we have been using monitoring tools (TeMIP, Zabbix etc.) since the previous cloud service; however it takes a long time to conduct failure analysis across multi-component OpenStack-based cloud service because failure causes are more complex. Our goal is to develop an effective failure analysis tool. To achieve this goal, we added some functions such as "the automatic analysis of states/logs along the service procedure flow", "the cause suggestion based on dependency learning" after analyzing failure cases. These improvements have helped us conduct a failure analysis of service down (instance creation failure etc.) more quickly. In this presentation, we share our development knowledge and use cases.

What can I expect to learn?

Traditional operation tools for an Enterprise public cloud service
Challenges we faced in the operation of a multi-components cloud environment (especially in failure analysis)
How these challenges can be solved. For example, you can learn the following solutions.

Solution (1): The automatic analysis of states/logs along the service procedure flow composed of some process and DB

- This can solve the above use case(1) because it can analysis the related process precisely.
- For example, this would detect the silent failure in nova-compute corresponding to the service down (fails to create instance).

Solution (2): The cause suggestion based on dependency learning between requests, performance (CPU, Memory, I/O, NW etc.), architecture and errors

- This can solve the above use case(2) because it can suggest causes comparing to related past cases from various aspects.
- The relation between workload and error can also help capacity planning.

Thursday, October 27, 4:40pm-5:20pm (2:40pm - 3:20pm UTC)

CCIB - Centre de Convencions Internacional de Barcelona - P1 - Room 112

View video

Difficulty Level: Intermediate

Tags: Architect Enterprise Ops Operator Nova Rally UX Public Cloud

Noriko Yokoyama

Software Engineer

Noriko Yokoyama is a Software Engineer, working at NTT Communications in the cloud service department since 2015. She works with the operation engineering team and develops operation tools for NTT’s enterprise cloud service. Before that, she worked at NTT Service Evolution Laboratories for more than three years. Her research interests include big data analysis and action support... FULL PROFILE

Hirotaka Kojima

Software Engineer

Hirotaka Kojima is Software Engineer, Cloud Service Development at NTT Communications in Tokyo, Japan. He is working to manage OpenStack Nova based Cloud Service (eg. Enterprise Cloud of NTT Communications). He also has one year experience working as a system administrator for Unix/Linux Web Hosting Service for Verio, Inc which is a subsidiary of NTT America, in Boca Raton, U.S.A.. He... FULL PROFILE

Event Details

Registration Opening Soon