Since January 2, 2018, Intel CPU’s chip-level vulnerabilities Meltdown and Spectre have been made public by the media, with nearly unprecedented coverage. Yet, even after nearly a month of full industrial chain efforts to fix the vulnerabilities, there still seems to be no final solution. I think this may be one of the industry’s most serious challenges, in addition to the millennium bug problem, since the birth of the computer! Some experts even predicted that CPU Moore’s Law may be nearing the end because of the impact of this incident.
Reacting to CPU M / S vulnerability by, just patching, you may open a terrible Pandora’s Box! In this blog, I’ll tell you why I’m not being an alarmist, the most important things you need to think about, and what are the correct next steps.
The Impact of CPU M/S Vulnerabilities
1. The relevance of the vulnerability: devices and systems you know may not be immune
According to the evaluation and report of international, well-known research institutions, equipment manufacturers, the media, etc., almost all processors manufactured in the past 20 years have been affected by these two vulnerabilities, including but not limited to Intel, AMD, Qualcomm and ARM. Even NVIDIA GPUs for home use (including GeForce, Quadro, NVS, Tesla, and GRID, mostly influenced by Spectre vulnerabilities) are susceptible.
Devices and systems running on these CPUs include mobile phones, mobile terminals, PCs, servers, cloud virtual machines, and other specialized devices. And, it doesn’t matter which operating systems are running on them because the underlying CPUs contain the vulnerabilities.
2. The severity/risk of the vulnerability: critical sensitive information will be leaked
By exploiting the CPU M / S vulnerability, hackers may:
- Access the underlying operating system operating information and stealsecret key information;
- Grab information that will allow them to bypass the kernel, or hypervisor isolation protection;
- Gain access to multiple tenants running on shared cloud services and steal private information;
- Capture critical information such as the victim’s account number, password, bank card account number and password, other information content, email address, cookies and other user privacy information through the browser;
The Right Steps to React to CPU M/S Vulnerabilities
1. Enhance the security of key information assets
As of today (Feb 16, 2018) there still is no good patch available, and it is imperative to upgrade the security of key information assets. We should re-examine the information security technology and management system, and further optimize the logic of protection and deployment of technologies. Obviously, initiating dual- or multiple-factor verification of key information access and processing may be one of the most cost-effective and rapid methods.
2. Pay close attention to the progress
Given the current circumstances, we need to keep a close watch on announcements and evaluations of major CPU vendors, operating system vendors, equipment manufacturers and related technical communities, and be ready to act if they release new information on remediation or improved and stable software patches.
3. New patches: test, test, test again
I’m going to say this three times, “Test, Test and Test again.” This is especially relevant for these vulnerabilities as the patches released to date have resulted in system failures or performance hits. Just because a vendor released a patch, doesn’t mean it will work in your environment.
4. Do a good job of computing resource capacity planning
Numerous experts have analyzed the performance after patching and determined that the patch will significantly affect the system performance (slowing by 10% to 30%), and future patches are not likely to be better. Many companies’ IT systems may crash even with 10% decline in the performance.
A rational approach would include looking at past experience of capacity to anticipate where slow-downs will be a concern. However, given the lack of history with these vulnerabilities and patches, that might not be relevant.
Feedback from AWS and from different users in the AWS community shows that the performance degradation that results from this varies greatly from system to system.
Let me illustrate with this non-technical example. Suppose thatyou usually need to drive 40 minutes to go home from work each day. Now suppose that you encounter a bad traffic jam or a traffic accident., You have seen similar traffic jams in the past and have some idea that it will delay you, but you can’t be sure by how much time. Is it just excess traffic from rush hour, or is there a serious accident up ahead that will shut down the freeway completely?
Due to the complexity of capacity, the performance impact of the past may be completely invalid, causing you to quickly adapt to new variables requiring you to provision excess capacity to account for unknowns.
It is time to engage capacity analysis and forecasting systems!
a. Collect historical performance data of the system, including business performance KPIs such as number of transactions per time unit, number of concurrent users, total time for a single transaction, etc., infrastructure resource performance KPIs such as CPU utilization, memory utilization, IO read and write speed, etc.), key component performance KPIs (typical transaction processing time for databases, middleware, etc.).
b. Model capacity using historical performance data. Utilize tools that leverage machine learning, other artificial intelligence algorithms, and data training, to establish a historical capacity model.
c. Perform a performance stress test of the new patch in the system test environment, collect performance data, and re-model using capacity analysis tools.
d. Compare the difference between the two models before and after. Taking forecasted business performance KPI capacity trends from historical data models as capacity requirements, using test data models for forecasting resource requirements after patching, using resource requirements forecasted by historical data models as a resource capacity requirement floor, or using other methods consolidate two different capacity forecast results.
e. If the new test data is not enough for data training and modeling, analyze the percentage change (deterioration) of performance under typical load and conduct a what-if analysis under the historical data model. That is, if different resource performance parameters are different Proportion of changes, analysis of capacity requirements will happen what changes.
5. Prepare for new computing resources provisioning
According to priority of the patching for this bug, referring to the capacity analysis and forecast result, prepare the corresponding resources, and make the external environment ready for the patch change and deployment.
6. Keep communication with business units and customers
As soon as possible inform the business unit that the CPU M / S vulnerabilities may have a negative business impact, and develop reaction plans together. Business units need to maintain good communication with customers. When preparing the patch deployment, it is necessary to evaluate the change plan, impact analysis and rollback plan jointly with the business department in advance, and wait for the business department to prepare and agree to carry out the change.
7. Production environment patch deployment and resource expansion
This will be a large-scale deployment. If you use the Patch Automation deployment tool to configure the deployment strategy, you do not have to worry about the deployment speed and manual operation risk. In today’s automation world, many enterprise IT resource expansion has also been automated.
8. Continue to monitor the operation
After patching, it is necessary to timely configure the monitoring system to enhance the frequency of monitoring, closely monitor their running status, continuously collect feedback of other enterprises, communities using new patches from various channels, and take timely measures to deal with it.
In summary, the Meltdown / Spectre vulnerabilities are not to be taken lightly. The vendors’ attempts at patches haven’t produced solid patches and at the same time, hackers are actively working on exploits. You should have good capacity planning TrueSight Capacity Optimization for potential patches major impact on performance. For those patches that have been released, it’s important to test, not only whether the patches are stable, but also how any performance degradation impacts your applications or services. One thing that you can do now, is better understand where those vulnerabilities exist in your environment (SecOps Response Service can help with this) and plan for a fast and smooth remediation process once you have tested the patches (BladeLogic Automation can help with this).
Li Peng has contribution to this article