Just imagine the biggest city in Europe and how dozens of millions of citizens are dependent on the stable work of the city’s digital services during the pandemic. Doctor’s check-ups and appointments, coronavirus tests’ results, digital grades’ lists for children, taxes declarations, and payments — everything is online. Managing this “IT zoo” is not simple and not cheap at all, but the costs of errors are very high.
Hi, I’m Nikolay Ganyushkin, CEO of MONQ Digital Lab. My today’s story is about opportunities that MONQ AIOps (Artificial Intelligence for IT Operations, Gartner) solutions give to the business and how it helps to manage IT infrastructure and digital services more efficiently. Let’s have a look at it throughout the story of one of AIOps MONQ’s customers from the public services industry. This case illustrates how AI technologies help to solve the main IT management issues.
Just a couple of details about the client that helps to understand the scale of Its IT infrastructure and digital services:
- Industry: Public Sector
- The number of unique users of digital services per day: >1 000. 000. 000
- The average number of requests for services per day: >150 000
- The number of independent business units: 35
- The volume of events and triggers processed daily in IT: up to 30 000
Issue Solved #1. Users Notice Failures Before IT, Incidents’ Resolution Is Too Long
The client was in the state when public services portal, digital services, and software for employees had been working unstably. External and internal users were complaining about the unavailability of services, but due to complexity IT could not ensure the reliability of services and discovered problems after users. Critical incidents had been resolving for more than 30 minutes, sometimes up to 1 hour.
At this time, the main goals were detecting failures before users notice them, reducing incident resolution time, and monitor services from the user’s point of view, for instance, by running centrally autotests which would simulate user behavior in the system. Based on these tests, support engineers could look for possible errors, threats, and failures and react proactively.
Moreover, the client needed to build a hybrid monitoring system and correlate business and technical data to answer the question of where exactly the failure occurred.
How the issue has been solved?
1. The set up synthetic interface testing
- Zabbix + Jenkins + Selenium + Allure have been connected to AIOps MONQ
- 87 client’s information systems have been connected to AIOps MONQ
- The automatic testing (on average, 3–7 tests, 10 steps for each system)
- Configuration of over 7,000 metrics and over 2,600 triggers
2. MONQ’s Hybrid Monitoring
- Monitoring systems data and logs have been connected to MONQ
- MONQ automatically groups events, assigns severity based on the potential threat of a failure
- Auto-escalation of events has been set up
- Engineers have begun to automatically receive notifications about important events and respond to failures in time before users notice them
At first, the client was focused on the monitoring of simple systems, external and internal portals. For example, it had been monitoring an internal portal where the script included simple authorizations, checking the availability of content, searching for buttons, checking subscription services, etc. Then he has decided to monitor systems with more complex logic. For example, in public services, legal entities need an EDS [electronic digital signature], and to check the correctness of the report generation function, it was necessary to look at the composition of Word documents.
What were the results achieved?
- IT has begun to respond to the outages before users notice them
- Service availability has grown up to 98.5% (up 1.2% from 97.3%)
- The volume of complaints has reduced by 8 times (from 40% to 5%)
- The average speed of resolving a failure after the implementation of MONQ has increased from 30–60 minutes to 15 minutes
Issue #2. High Costs For IT Monitoring.
The client operated over 100 city information systems, worked with 22 contractors, and the systems were manually monitored by 50 engineers. There was an expensive contract for ineffective manual checks that cost over $ 1.5 million per year. If a failure was detected, the incident had been resolving manually, it had been taking a long time.
The monitoring costs were very high, nevertheless, the efficiency was low due to the difficulties in coordinating all disparate processes, and, of course, the human factor. A contractor had been manually checking a large number of interfaces and services. There were about 80 information systems and five scenarios to check, which is a total of 400 scripts to process. They had been checked a minimum of 10 times per day. In total, 4,000 manual checks were performed daily, each check took a minimum of 20 minutes. If a problem was found, the incident resolution had been processed manually. The engineers’ time was used ineffectively.
The main goal at this step was to reduce the cost of IT monitoring and improve incident management.
To achieve it, the client has started to use the following AIOps functionality:
- Manual checks have been replaced by automatic synthetic testing
- Use of Hybrid monitoring. A single screen for dealing with incidents has been created by connecting all monitoring systems to the MONQ AIOps. configured autoclusterisation and event deduplication have been set up as well.
- Automated incident management. Auto-escalation of events was set up, as well as their automatic registration in ITSM, notifications of teams, tips for engineers to resolve problems, automation scripts for routine tasks
What results have been achieved?
- Costs on synthetic monitoring have decreased by … 20 times. The client has reduced the cost of user interface monitoring from $ 1.5 million per year to $ 77 000.
- Overall IT support costs have decreased by 30% as fewer engineers are employed to resolve incidents.
- The number of employees who worked in monitoring processes has been halved.
- Reorganization of situational centers. Two of the three situational centers were reorganized from control rooms into working groups of products that do not just look at screens but perform useful operations.
Issue #3. A Lot of Noise From Monitoring Systems, The IT Visibility Is Low
Due to the complexity of IT infrastructure, one information system could send several hundred notifications per day. It was not clear which of those notifications were important and which ones needed to be responded to faster, and which could be dealt with later. Moreover, it is unclear how the incident could affect the business. So for the client, it was very important to sort alerts from monitoring systems and automate this flow so that engineers receive only useful events that need to be responded to.
This large use case had required the following AIOps functionality:
1. Autoclustering and deduplication functions. Using MONQ, auto-clustering, and event deduplication were set up.
2. Resource-service model that visualizes the current state of health of all IT systems on one screen. The resource-service model made it possible to see which services crash could critically affect the availability of digital services and to correctly prioritize incident resolution.
3. Visualization and health map. The client received advanced visualization tools. By the dependency graph, it was possible to set up triggers and combine events.
4. Integration with ITSM. Integration with ITSM made it possible to multiply the number of created incidents. An average of 28 incidents was created automatically per day. Previously, it took about 20 minutes for one employee of the situation center to create one incident. To work out the same volumes as MONQ does, the client would need 10 employees simultaneously working per day.
5. Alerting responsible teams & criticality levels. MONQ fixes all problems and instantly notifies responsible teams. The client, as the operator of the system, determined the list of responsible persons to whom notifications go. The client uses various types of notifications to email and messengers. He has adjusted the criticality of an alert: there are 4 types of alerts, from high to low priority, with different levels of responsibility. The notification itself contains information about its priority, and it is immediately clear how quickly it is needed to react to it.
In MONQ AIOps, a resource-service model can be created in 10 minutes without programming skills. It is possible to create a model, bind tests, run them, and set up automated rules for notifying the necessary teams that participate in the project, as well as register an incident in the ITSM with subsequent visualization in the progress system solving the problem. Setting up notifications is very flexible. If, for example, a failure lasts 5 minutes the system sends the notification by mail, if more than 5 minutes — by messenger. If it is more than 10 minutes the system officially registers the incident. All these functions are a part of “the product boх”.
1. The number of useless notifications has reduced by 64% due to complex processing logic.
2. The general state of IT and the possible impact of a failure on the stability of services have become clear. It has become simple to prioritize actions and react to those that are important.
4. Issue #4. Manual IT Incidents Resolution
In each IT unit, the state of the systems was monitored by engineers who manually sorted events and made decisions about their criticality. The client had a lot of information systems written by different developers in different languages. Each information system often had its own monitoring, which generated a huge stream of events.
Incident management had to be automated. The client wanted its engineers to be automatically informed about the deviations in working processes and emergencies. Also, it was important to register the incidents for their manual or automatic elimination. The system for automated incident resolution should provide a new level of visualization and centrally manage data from Zabbix, Jenkins, and create incidents and alerts.
The scale and maturity of the client’s products were different. As a result, IT support was very complex. Some of the products had already connected monitoring systems, others not. To all this, there was the resistance of system administrators who wanted to use only their own closed circuits. MONQ AIOps has helped to standardize the monitoring processes, but the operation itself was completely transferred into local teams. Effective collaboration has finally emerged between related teams.
What AIOps functionality has been used to solve the issue?
1. Automated incident management. Auto-escalation of events was set up. All incidents are automatically registered in ITSM. Teams get notifications about IT issues and tips on problems’ resolving.
2. Automation with scripts. Routine tasks, such as reloading of a server, were automated by using automation scripts for routine tasks. The automation was launched at the level of auto-execution of scripts with support for BASH and REST inside.
3. A standard was introduced. The client has connected 26 Zabbix servers to MONQ.
4. One screen for all systems. Almost all major services were connected to MONQ. The system processes up to 30,000 events and triggers per day.
What results have been achieved?
1. Incidents are being resolved automatically. There are being automatically recorded and assigned to the responsible teams. Typical routine troubleshooting operations are automated.
2. Incident assignment has accelerated 10 times. Incident assignment time has been reduced to a couple of minutes instead of 25 minutes earlier.
5. Issue #5. Contractors Overestimate SLA
The client had service support contracts, but it was unprofitable for contractors to register all the incidents in order not to spoil their KPIs by underestimating SLA, for instance, at night. The client hadn’t a tool to control it and wanted all failures to be recorded as well as detailed reports on systems performance.
What has been done?
1. Automatic registration of all incidents. MONQ made it possible to prescribe the response time to failures in the system maintenance conditions, as well as automatically calculate, for example, unavailability. All incidents are immediately registered and transferred to a responsible contractor or subcontractor.
2. SLA reports are built automatically. In SLA reports, reports are automatically generated on the availability of each of the services. The Situation Center assigns tags to events, sees an objective picture of the calculations, and performs the analysis. It is possible to export reports in .xls, .pdf-formats.
As a result, the client has received an objective control tool for contractors and stopped paying contractors for unfulfilled SLA.
Is it possible for your company?
Yes. You can achieve even better results. Want to try an AIOps? Just write to us.