Artificial Intelligence in IT Operations (AIOps)

Read Time:8 Minute, 41 Second

Artificial intelligence in IT operations is a huge subject but with lots of potential to disrupt the industry.

We saw many examples and many use cases of AIOps, and we saw how AIOps can reduce the cost and improve the efficiency of various services. It’s clear that the industry is moving towards greater and greater adoption of data science and artificial intelligence in IT operations.

Table of Contents

What is AIOps? What is operational analytics?

So the definition by Techopedia defines operational analytics as the application of business analytics on operations. It is as simple as that.

This means that tools and methods from domains such as data mining, they are just used on data that is extracted from operations in order to extract insights and optimize decision making.

AIOps stands for artificial intelligence in IT operations.

It refers to the use of data science and AI to analyze big data from various IT and business operations tools.

The goal is to increase the speed of delivery of the various services to improve efficiency of IT services and in general to provide a superior user experience.

It’s clear that through these AIOps enables themes to move away from siloed operations.

It enables the generation of insight, which can be communicated to the stakeholders and it can help drive automation and collaboration within an organization.

So why should someone care about AIOps? So why should an executive or a manager care about AIOps?

Well, if you are the kind of business that has a large infrastructure and is depending on the cloud for its day to day operations then you probably know that the down time is costly and the service can get slow at times, which increases the cost even further.

Servers and cloud infrastructures need proactive management.

However, the complexity of doing this is too high.

You need to hire teams of people who can always be there and deal with every alert and sometimes they do mistakes.

The premise of AIOps is that many of those issues, they can be solved through automation.

And there are many use cases for AIOps, some being faster root-cause analysis, predictive analytics, noise reduction, and intelligent automation.

We’re gonna be seeing some of these in this presentation.

Before we go on to the use cases, let’s take a moment to check this research out, which was conducted on more than 100 IT professionals.

The IT professionals were asked what do they believe about AIOps?

Where AIOps could help, what challenges they face, and out of all those surveyed, which was more than 100 people, over 70% identified alert correlation and proactive issue detection as the two biggest challenges they face.

AIOps can also help reduce the noise.

Out of all the people that were asked, how can AIOps and machine learning help increase automation across your toolchain?

Most people said that AIOps can help them by providing faster and more accurate root-cause analysis.

They also mentioned that it can help them automate the analysis of an event.

This is directly related to the first point, and obviously this can all help reduce alert noise.

This is related to alert fatigue. This refers to a case where there are so many alerts being generated by the system that humans find it difficult to handle all of them.

Let’s check out the basic elements of AIOps.

At the bottom of the stock, we have the different data sources.

This can be events such as alerts, real metrics that we are using to monitor a server such as the load of a server. Tickets, these are active issues that are being investigated.

And obviously logs are logs of activity.

Then, on top of this stack of the data sources, we have real time processing, rules and patterns, and domain algorithms.

And these are materialized through the use of machine learning and artificial intelligence.

When you use machine learning and artificial intelligence to create algorithms that sometimes they run on rules and sometimes they just run on pure machine learning such as deep neural networks, we can digest all those different data sources and then automate many of the things that the human teams are doing.

In this image here, produced by Gartner in 2016, we can see a diagram that explains how AIOps is working.

We see that on the outer circle that encapsulates everything else, we have business value.

This is the most important component of AIOps.

It creates business value by improving the quality of the service and reducing costs.

At the second layer, we see three smaller, let’s say parts of a circle, monitoring, service desk, and automation.

So monitoring could first be the act of observing what’s happening.

Service desk refers to the ticket management, giving direction between the team and the platform and the customers.

automation is what AI, the machine learning, is offering.

there’s another circle in the middle which talks about continuous insights.

all these, the monitoring and the engagement and the automation and the insights are generated by the core, which is based on machine learning and big data.

In terms of adoption, Gartner tells us that by 2019 25% of enterprises will be using AIOps to support two or more major IT operations functions.

So, sooner or later we’re getting there.

AIOps is becoming more and more popular. Why?

Well, because it’s a great idea. It works, it reduces costs and improves quality of the service.

So how can you do operational analytics?

Let’s talk a bit about that.

As a general guide, operational analytics and AIOps, they do not really describe a single use case.

As I mentioned in the first part, AIOps is about using data signs, machine learning, AI, data mining, on data from operations in order to extract insight in automated processes.

This means that there are many tools which can be useful in this effort, including dashboards or various kinds of machine learning models.

There are many applications of AIOps and I already mentioned some of these a root cause analysis being one of the more popular.

Optimizing the availability of a network is another very popular operation.

Automatic ticket and problem assignment, anomaly detection for cybersecurity reasons is another important use case. And also improving storage management.

Gartner that has done a lot of work in AIOps is providing us with another very useful chart, how an organization can excel in AI for IT operations.

So there are four phases, the establishment phase, the reactive phase, the proactive phase, and the expansion phase.

It’s a very intuitive diagram, and the first phase, the establishment phase, is about understanding what are the challenges related to operations that an organization faces and then these challenges are gonna be solved in the reactive and the proactive phase.

The differences between the two phases is that the reactive phase is simpler from a technical perspective, whereas the proactive phase is more advanced because it’s based on prediction.

At some point in this whole process you want to move storage prediction.

You want to be able to see problems before they come.

And once you can do this then obviously you want to expand and try to automate as many of your operations as possible.

The root cause analysis

Talking about root cause analysis, that’s a very interesting reference.

I’m not going to go into it in much depth, but there are many papers out there.

This one’s provided here for your reference that studied the problem of identifying the root of a particular problem.

So we have classification models.

We have some of those based on logic.

In my opinion, a huge part of this research deals with graphical models, also known as Bayesian networks because the way these models are structured makes it easier to identify causes and effects in a complex system.

So this is an example of a graphical model.

This example is not related to AIOps, this particular graphical model, but it’s presented here because it’s very easy to understand. So graphical models allow us to express causal relationships in a very intuitive way.

I don’t really have to explain much about this image. But you can see it will have different nodes which represent variables and what we’re essentially trying to do here is find how all these variables are related to cardiovascular disease.

So we that stress can be related to smoking and cardiovascular disease.

Income can be related to a sedentary lifestyle.

So imagine that instead of this diagram. We had a large number of variables that are interconnected and describe how a complex system. Such as a cloud, might fail in some circumstances.

But defining these relationships and then feeding data into this model. The model can tell us the probability of the system failing. Where the weak point are and also understand if a problem actually takes place what seems to have been the main variable. The main factor that caused the problem.

An interesting research paper coming from the area of AIOps is the following one. Which dealt with a prediction of network availability.

And the important takeaway from the paper is that it is possible to create machine learning. Classifiers that have a very high degree of accuracy on predicting. Whether machines are going to fail before they happen. This comes back to the point I mentioned earlier, that predictive analytics can be very powerful. Because if you know that some sensor is about to fail then you can fix them before the problem occurs.

Another interesting area for AIOps is automated problem resolution.

So this refers to the automated handling of problems, such as prioritizing tickets or categorizing them or assigning them to the right person who can look into this problem further.

Again this research around this area is well from academia. And there’s, there are algorithms that can achieve over 90% accuracy in assigning tickets to the correct person.

Finally, AIOps also includes cybersecurity.

It’s one very important component of IT operations. And here methods come from the family of anomaly detection or outlier detection. In other words, extremely relevant.

And anomaly detection can help us understand whether there’s something unusual about the time series. Or some other event such as unusual traffic due to hackers trying to break into a server. There are many also commercial platforms, such as IBM Watson, that are now offering this service.