Is your own Equifax crisis hiding in your infrastructure?
We may never know why the disastrous security flaw that allowed hackers to steal the personal information for more than 145 million people was still present on Equifax systems two months after it was first discovered. What anybody who works in enterprise IT does know, however, is that even when you do know about security updates that need to be made, once your infrastructure reaches a certain degree of complexity, knowing can be easier than doing.
Looking at the fallout from the Equifax breach, however, it's hard to make the case for that as a valid excuse. Fortunately, it doesn't have to be that way; the same technology that makes it possible to create massively complex systems also makes it possible to keep them updated and secure -- if you do it right. As we saw with the Equifax breach, however, there are lots of ways to get things wrong.
What happened at EquifaxThe vulnerability that enabled hackers to breach Equifax's security systems wasn't due to anything that engineers at Equifax did, exactly. Instead, it stemmed from a bug that existed in one of the software packages Equifax used to run the website consumers used to file disputes regarding information on their credit reports.
Once hackers were able to breach the web server, it seems, they were able to work their way to related Equifax systems, making a series of intrusions and stealing sensitive personal information for 44% of the population of the United States.
Seems like there was nothing they could do, right?
The reality is that the vulnerability that led to this breach had been identified and disclosed by the U.S. Department of Homeland Security, US-CERT two months before the breaches began. What's more, once the attackers gained entrance to the initial system, they should have had very limited options; because such systems are on the front lines, so to speak, they should always follow the principle of "least privileges". While we can only speculate, of course, that doesn't seem to have been the case here.
Once the breaches began, it took another two and a half months for the company to notice -- and even then they didn't take the vulnerable system down until the next day, when they saw "further suspicious activity".
At this point it may sound like we're piling on Equifax, but that's not the case. The reality is that the company had a lot of things going against it.
Why keeping large systems safe can be difficultWhile it's tempting to think that Equifax simply ignored the vulnerability, that doesn't actually seem to have been the case. In fact, according to the company, "The particular vulnerability ... was identified and disclosed by U.S. CERT in early March 2017. Equifax's Security organization was aware of this vulnerability at that time, and took efforts to identify and to patch any vulnerable systems in the company's IT infrastructure."
So why did it take so long?
Again, we're only speculating here, but most enterprise systems suffer from the same problem: individuality. While it's great for people, it's not so great when you've got dozens or hundreds or even thousands of servers, and they all need individual care. Manually configuring and documenting the status of individual servers can quickly become unmanageable. What's more, the difficulty of keeping up with what needs to be done sometimes leads operators to cut corners by relaxing security or increasing permissions to get around problems, rather than trying to make everything consistent -- and correct.
Once these systems are up and running, there are so many different logs and alerts and events that it's impossible to simply follow them all without some form of dashboard, and even then, the raw data doesn't necessarily tell you anything. It's no wonder it took so long for Equifax to realize they'd been breached, and that they had to hire a security firm to tell them the extent of the damage.
How Equifax could have prevented itEquifax's problems seems to be divided into three phases: before, during, and after.
If Equifax's security team knew about the vulnerability months before the breach, why wasn't it patched? The answer has two parts.
First, when you've got production systems that large, you can't just go applying patches without testing; you could easily destroy your entire deployment. Instead, engineers or operators need to isolate the problem and the fix, then test the fix before planning to deploy it. Those tests need to be in an environment that is as close to the production environment as possible, and when the fix is deployed, that deployment has to match the way it was done in testing in order to duplicate the results.
Once you've determined what the fix is and how to deploy it, you need to go ahead and do that -- something that can take a significant amount of time and manpower.
Obviously this isn't something that can be easily achieved manually, even with just a handful of servers.
Infrastructure as CodeWhen We talk about "manually" configuring servers, the truth is that that's more and more rare these days, as configuration management systems such as Puppet, Ansible, and Salt become more common. These systems enable administrators or operators to specify either the end condition they want, in the case of declarative systems, or the actions that should be taken, in the case of imperative systems. Either way, they wind up with a script that can be treated like program code.
This is important for a number of different reasons. The most obvious is that it enables administrators to easily manage multiple machines with a single set of commands, but that's just the beginning.
When we say that these scripts can be treated like program code, this has a number of different implications:
- They can be checked into version control systems such as Git, making it possible to keep track of the "official" version of the various scripts, and any changes that are made to them.
- They can be incorporated into a Continuous Integration / Continuous Deployment (CI/CD) system that enables them to be tested and deployed automatically (if appropriate) when changes are made, rather than having an administrator manually address each server individually.
- When a fix needs to be made, these scripts can be analyzed to determine what systems are affected, and the fix can be easily integrated, tested, and deployed.
- Fixes can be made not only to servers, but also to security policies, web application frameworks, and other pieces of the security puzzle, enabling manageable virtual patching and rollback.
- Because everything is scripted, the environment can be strictly controlled, ensuring that when a fix is deployed, it's deployed in exactly the way it was deployed for testing.
Scripting these operations can also solve another problem: the tendency to cut corners and loosen security because it's easier than figuring out how to make everything work without opening the system door wide.
Monitoring and pro-active managementOf course, no system is perfect; even if Equifax had managed to implement every patch as soon as it were made available, there's no guarantee they wouldn't get hit with a so-called "zero-day vulnerability" that hadn't yet been disclosed.
To solve this problem, it's important to have a meaningful logging, monitoring, and alerting system that makes it possible to spot problems and anomalies as early as possible.
That means more than just having logs; logs can provide information on specific errors, but won't show you trends. For that you need tools such as Grafana or other time-based reporting tools. In addition, newer technologies such as Machine Learning can spot anomalies humans might miss.
Finally, the best reporting in the world won't save you if nobody's paying attention. The best monitoring is pro-active, and the same skills that make it possible to watch for trends and predict problems such as hardware failure can make it possible to spot issues such as a large data outflow that shouldn't be happening, and take action immediately, rather than waiting for it to happen again in order to be "sure" of what you're seeing.