Machine Learning and AI in the Datacenter and How It Will Affect You

Recently, Mirantis’ own Nick Chase presented on challenges to the datacenter in a CNCF webinar called Machine Learning and AI in the Datacenter and How It Will Affect You. On Tuesday, May 8, Nick Chase he will present a second CNCF webinar on datacenter related issues, Enhancing control over Kubernetes with Spinnaker as Continuous Delivery, so we thought we’d bring you an excerpt from that earlier talk.

Now let’s talk about some of the difficulties in running an efficient datacenter and how machine learning and AI techniques can help. Remember, I’m not saying you need to jump in and do all of these things from the start, I’m just trying to show you that there are opportunities to improve things with AI right under your nose, if you want to go that way.

Configuration

So starting at the beginning of the process, let’s talk about configuration.  Self service is supposed to be an advantage, and it is for administrators because they don’t have to have the burden of providing resources for everyone who needs them. Unfortunately, it pushes that burden down to the developer, who might not be familiar with what’s available, or specifically how to take best advantage of it.  I’ve been a developer for years, and I know I don’t want me making decisions about the best way to set up a datacenter.

And the situation is even worse with microservices, because even once you’ve got the system itself setup, there are so many theoretical choices for developers, there are tons of opportunities to have problems.

And of course with dynamic environments, we have lots of opportunity for configuration drift.  This is either going to get worse with immutable architecture because it’s all baked into instances that are going to break when things change, or it’s going to get better because people know it’s baked into instances that are going to break when things change.  Personally, I think it’s going to be worse unless we can get better processes in place.

Adding intelligence

You can think of a lot of these things like playing a game of “Telephone”.  You remember that game where you whisper something to someone and they whisper it to someone else and by the end of the line you don’t recognize what comes out?  That’s funny when it’s a game. It’s not so funny when your entire infrastructure is in metaphorical flames.

So how can we add intelligence to this process?  Well, for one thing, a machine learning routine can look at previous deployments and potentially figure out, well, this is similar to this app, which did best when it was deployed in this way, so I’m going to recommend that, or even just do it. Even after it’s been deployed, you might use reinforcement learning to get the system to retroactively improve performance or resource usage.

Performance Optimization

Once you’ve got things configured you do want to get that optimization right, which can be tough because in this distributed dynamic environment, there are a ton of factors to consider, and some of them are probably not even visible to the developer, much less controllable by him or her.  

And even if you do get a perfectly optimized environment or deployment, once things change, like a cluster goes down, or there’s a network slowdown somewhere, your optimization goes out the window.

So by using machine learning, we can create a setup that takes into account multiple factors and the way they all interrelate in a way that humans are going to have a hard time matching.  

The system can also predict future load and proactively take steps to scale up. So for example, if your company ran a super bowl ad, the system might take several factors into account and realize a slight increase during the commercial wasn’t just a blip and scale up before a huge increase hit when the commercial was over.

Or your system might expect more people to be watching the game in Philadelphia and Boston based on patterns picked up beforehand, and redirect resources accordingly.

Cost optimization

And sometimes optimization isn’t even about performance, but cost.  With the rise of public cloud and a “pay as you go” strategy, making sure you are not paying for something you don’t need is really important.  To give you an example, I myself recently got a bill from AWS for $156 for load balancers I didn’t even realize I’d spun up when testing something the previous month.  So there are way more moving pieces than most people can keep track of.

And even if you do keep track of them, the prices are always changing.  Fortunately, the unit prices generally go down rather than up as all of these companies fight to get you locked into their own cloud, but the fact remains that there are often times when you’re not on the most cost effective resources, even though you were yesterday.

And we can’t even overlook the obvious.  With all of this data flying around, and there is more every day, we are running into HUGE storage costs.  There’s one auto company that generates so much data on a test drive that they can’t transmit it to a central location fast enough, and they have to send it to hard drives in a chase van.

So in terms of how machine learning can help, of course one thing that it’s very good at is keeping track of all of these multiple parameters, so it can figure out if you’re better off on this system or that one based on usage in ways that maybe aren’t obvious.  For example, maybe compute is cheaper at provider A, but you don’t use a lot of compute, you use a lot of storage and you transmit it, so you’re better off at provider B.

And I’m not saying that you couldn’t code something up to figure all this out without machine learning, of course, I’m just saying that it lends itself to that, especially in these changing environments.

Also,  as far as the storage issue, you can proactively decide what data to keep and what to throw away.  Now, I’m going to be honest with you, that makes me incredibly nervous, because on a personal level I’m very much a data hoarder, but storage experts assure me that this is absolutely necessary or we’re just going to drown in all of the data that we’re generating on a daily basis.

Fault detection

OK, so you’ve got everything configured, you’ve got it optimized, and you’re not paying more for it than you need to.  But things can still break. Hardware goes down. Software configurations drift (we talked about that earlier). Sometimes you have data corruption going on, and you might not even know.  You might have silent data corruption caused by bad firmware, or even loud noises or cosmic radiation. (Yes, that really is a thing.)

Silent data corruption doesn’t get noticed by the operating system, and you might not find it until much, much too late.  Remember that broken telephone analogy, this could be a huge issue before you even knew it.

You might also  need to get a head start on running out of storage space or memory or other resources.

So of course you can do auto scaling without machine learning, but as we’ve been saying, by adding it in you have the ability to do more sophisticated versions of it.

You can also let the system detect patterns that might indicate a disk is going to go bad in the near future, or other hardware issues are going to pop up, so you can move data or resources and replace that hardware before it becomes a problem. Machine learning is also good for recognizing non-patterns, or anomalies that indicate that something isn’t right, whether it’s again, hardware that’s going bad, or software or data that’s been corrupted.

And of course there are security issues, which is somewhere that machine learning is already in play.  I think we’re all pretty familiar with Denial of Service attacks, but potentially even more scary is Advanced Persistent Threats, where the attacker gets into your system ahead of time and just sits there, undetected, for months or years, just stealing data or doing all kinds of untold damage. And then there’s inside jobs, where someone who should have access is doing things that they probably shouldn’t be doing, even if it’s all technically allowed. So machine learning gives us the ability to do anomaly detection to find those APTs.  You might also use it to find situations where what somebody’s doing is just not … right.

Pattern recognition can help you detect when a DDOS is heading your way so you can take steps to prevent issues, insomuch as that’s possible.  It might also help you locate the source of the attack and block it. Obviously that’s tougher with a distributed attack such as the Mirai botnet, where millions of IoT devices were taken over and used for this kind of thing, but that’s the whole point.  A machine learning routine might quickly figure out that that’s what’s happening and help you filter out that traffic.

You can view the entire talk here:

If you’re interested in the idea of adding control to your datacenter, check out our upcoming webinar with the Cloud Native Computing Foundation, Enhancing control over Kubernetes with Spinnaker as Continuous Delivery, on May 8.

Latest Tweets

WEBINAR
Mirantis and Ericsson on Edge Computing