Should we use artificial intelligence to fix bad configuration data?
We’re all lazy, aren’t we?
This blog article about configuration data and artificial intelligence is written by Dimitris Finas, Technical Director – France, at Sweagle. This blog looks at the ways artificial intelligence and automation can support the management of configuration data change within a DevOps application estate.
As human beings, we are all inherently lazy. We try to postpone or avoid tasks because they take too much time, seem too complex, or involve searching out too much information. This is also true in IT and it’s a common factor between Dev and Ops. Both teams are always seeking solutions in order to make their daily tasks simpler or easier.
Sometimes, this seems like the quest for the Holy Grail. DevOps issues can be so complex, yet people expect them to be solved with the wave of a magic wand.
Current state of incident resolution & error detection
Let’s consider the current state of error detection and incident resolution. In most organizations, this process is still manual and therefore painful. Errors are often only caught after they occur – when the environment goes down or becomes unstable. In order to rectify the issue, various subject matter experts in Dev or Ops teams have to get involved. Engineers have to analyze what’s changed whilst trying to bring service back. Finding the root cause (especially in a state of panic) isn’t always straightforward.
So even if a lot of errors are detected automatically through monitoring tools, there aren’t that many organizations that already have the capability to detect errors before they occur.
Pros and cons of AI validation rules for configuration data
We need to move to a position where DevOps can validate all the config data before each deployment. To achieve this, we have to employ common good practice, like running validation rules for example. This involves creating scripts to ensure that everything is correct within the configuration data.
For example, a basic rule you’d want to implement is to enforce encryption of all passwords. Or you may want to check that you don’t use a connection string for a test environment in production. Or you may simply want to check that all required parameters are present and correct.
Additional benefits a rule framework can provide are:
The drawbacks of scripted rules are:
- You need to know the issue in order to test it
- You must maintain a library of rules.
Automatic detection of configuration data errors & root cause
This is definitely the dream of every Ops person and it is certainly the target I want to reach with every customer I work with. And we can achieve this through machine learning capabilities that identify possible root cause of many configuration data issues. If your organization relies on a deployment pipeline, expand this to detect config data errors before deployment. You can even achieve this without creating any rules beforehand.
Let me simplify the approach. Based on the status of each deployment you made, artificial intelligence (AI) will correlate what bad deployments have in common. You can get the status from user feedback or through integration to incidents management systems. AI will automatically eliminate wrong information, like a date field that always changes values. The more deployments you make, the more it will learn and the more accurate results will be.
An example of AI enabled deployments
Here’s an example. Imagine you deploy your application and encounter an issue because you haven’t set enough memory. You correct the error. You deploy several new releases with no issues and 6 months later, you deploy a new version and encounter problems again.
When checking what changed since the last release, you identify thousands of parameter updates and can’t clearly match what’s being impacted or not.
Now imagine you had that magic wand where you could click a button to perform advanced root cause analysis. This would launch analysis on all failed and successful deployments and identify in a few seconds the correlation between the last deployment and other failed ones. It will raise an alert on the memory that’s below an identified threshold, with a confidence indicator on the result – and may also give indications on other possible root causes.
What is the AI telling us? It’s targeting where you should look first.
For sure, real world problems are more complex because of the number of different teams involved, turnover of subject matter experts, lack of communication about changes or knowledge transfer. That’s exactly why you should get support from machine storage and calculation capacity to provide valuable inputs.
At one point in time, we can even switch from root cause analysis algorithms to predictive errors warning. The value will arise from improved Mean Time To Repair (MTTR) and definitely avoid some of the problems.
Is AI just a new problem in the handling of configuration data?
When I present new machine learning features to customers, some common objections I hear are:
Both objections are wrong. I’m sorry but AI is already here! Machine learning features can work as soon as you have deployed your applications at least 20 times. This is a ridiculously small number if you count the amount of times applications are deployed (in test, UAT, pre-production and so on) even before going live.
There’s definitely some fear of AI. Fear of being useless if the job is done by a machine. Then the fear is hidden behind other misplaced reasons.
In more than 20 years of working in IT, I see only one rule that’s always true:
The more you automate, the more work you have.
This is simply because the more that IT is efficient, the more people will rely on it. And so your scope of work will increase.
In addition, I want to reiterate that current AIOps features are mostly about human assistance than a replacement to human work. AI will augment Ops people by expanding their calculation and correlation capabilities beyond what humans can do alone.
So, we can use data – from way back in the past – to avoid repeating errors and to predict future issues.
All these machine calculations result in either a report with possible identified root causes ranked by priority, or in a quality/risk index published in a dashboard. Both results are human readable information that an Ops will use to inform further action.
In the end, all this is still under the control and management of human beings. So AI is not a problem, it is part of the solution. It allows us to handle more deployments faster than before in a more automated, controlled and standardized way.
In my opinion, the only valid question is how to take advantage of this AIOps automation right now?