Multi Cluster monitoring & How to save ourselves during major outages


This presentation introduces actual cases of outages and how to solve them by using our cluster monitoring data. In addition, we also show some useful data/metrics and difficulty of cluster monitoring with the large-scale environment. Scale and usage of our production cluster have increased day by day. It’s difficult to keep track of the cluster status in terms of API, Internal communication, and process base layer. Once we can review weekly/monthly cluster status, we can work on each problem proactively. Implementing Cluster Monitoring to discover errors in our multi-cluster environment can draw the SRE’s attention to the problem as soon as it happens. The main problems we have faced in the design of our Cluster Monitoring project are the huge amount of data, the complicated layer structure, and a wide type of metrics, each with its own Request Rate, parameters. We created a solution, which was aimed at finding potential failures as quickly as possible.

Reedip Banerjee
LINE Corporation
DevOps Engineer

Reedip has been working with the OpenStack community from Mitaka and is a Senior Software Engineer in LINE Corporation. He was previously Senior Technical Leader in NEC India and Senior Software Engineer in RedHat. He has experience of more than 13 years, mostly in the storage, cloud, and telecom domain. He is currently working as a Cloud Engineer for Private Openstack Cloud deployment. He has worked on different Proofs-of-Concept of Disaster Recovery, Cloud Integration with the Internet of Things, and Smart City. He has been involved with networking ( Neutron ). He has speaking experiences at Openstack Summits.