Joseph Mathenge – BMC Software | Blogs

IT Governance: An Introduction

Joseph Mathenge — Fri, 04 Mar 2022 00:00:08 +0000

Nearly all organizations are significantly dependent on technology. Even the smallest of enterprises will probably require a computer or mobile phone for communication, tracking of transactions, research, or accessing government services.

For most corporate entities, their strategies are heavily linked to exploiting emerging technologies through digital transformation. According to IBM’s research, executives rank technology as the top external force in 2022 that will impact their businesses in the near term, when compared with regulatory concerns and market factors. The top technologies they expect to deliver business results are:

The importance and dependence on technology means that organizations need to carefully ponder their investment in it as well as the risks that result from its use, including underutilization or misuse. Decisions regarding IT spend are no longer relegated to IT practitioners, but nowadays involve the highest levels of leadership.

That is where governance comes in—especially for entities which are heavily dependent on technology to achieve business objectives and are wary of the negative effects that could result from IT failures or misuse, such as loss of business and customers, negative reputation, and/or regulatory penalties.

Let’s take a deep dive into what IT governance is and how organizations can leverage governance to make a return on their investments in technology as well as limit its harmful impacts.

What is IT Governance?

The ISO/IEC 38500:2015 standard for the governance of IT for the organization defines IT governance as the system by which the current and future use of IT is directed and controlled.

Governance facilitates effective and prudent management of IT resources that facilitates long-term business success. IT governance is usually a subset of overall corporate governance, and as a result there is usually significant alignment between the two. The work of IT governance can be grouped into three activities according to COBIT:

Evaluating stakeholder needs, conditions and options to determine balanced, agreed-on enterprise objectives. This would include review of past business performance, future imperatives, as well as current and future operating model and environment. Assessments such as SWOT analysis, PESTEL analysis and risk assessments are important inputs of evaluation.
Directing the organization through prioritization and decision making. This is usually in the form of strategies and policies, as well as establishment of controls.
Monitoring performance and compliance against agreed-on direction, regulations and objectives. This is usually carried out through compliance audits and performance reports.

In most organizations, corporate governance is the responsibility of the board of directors, but specific governance responsibilities may be delegated to specific structures at an appropriate level, especially for large complex entities. An IT governance body might be a subset of the board with some depth of IT knowledge, or a group of senior executives (drawn from both business and IT) directly overseeing funding, management, and usage of IT.

The ITIL 4 Direct Plan and Improve guidance provides examples of key governance roles and their responsibilities:

Governance structure	Role in governance
Board of directors	Responsible for their organization’s governance. Specific responsibilities include: Setting strategic objectives Providing the leadership to implement strategy Supervising management Reporting to shareholders
Shareholders	Responsible for appointing directors and auditors to ensure effective governance
Audit committee	Responsible for supporting the board of directors by providing an independent assessment of management performance and conformance

Governance structure

Role in governance

Board of directors

Responsible for their organization’s governance. Specific responsibilities include:

Setting strategic objectives
Providing the leadership to implement strategy
Supervising management
Reporting to shareholders

Shareholders

Responsible for appointing directors and auditors to ensure effective governance

Audit committee

Responsible for supporting the board of directors by providing an independent assessment of management performance and conformance

Good vs bad governance

Governance is a function of human behavior. So, when it comes to good vs bad governance, the outcome is tied to two things:

Whether the governance body does its job responsibly and effectively.
Whether the stakeholders (i.e., management, employees, contractors or partners) are committed to upholding the governance framework.

Where the governance body is not knowledgeable or fully committed, there is a possibility that management ends up steering IT in a direction that may later harm the organization. Case in point is the abuse of user personal information or introduction of bias in machine learning by some organizations, which have resulted in severe regulatory penalties and reputational damage, translating into financial loss.

Bad IT governance can be characterized by the following signs:

The IT function makes all the decisions on the direction of technology without oversight or input from the rest of the business.
IT budget spend frequently spirals out of controls with unending or stalled projects that do not provide the expected benefits to the organization.
The governance body is reactive in nature, only called into action when things go wrong such as major IT system failures, negative audit findings, or regulatory issues.
IT objectives are not aligned with the organization’s strategic objectives.

Good IT governance takes a holistic approach, ensuring that all stakeholders are involved and committed to putting in place all the necessary elements required to build and sustain an effective governance framework. COBIT gives a list of such components including: processes, organizational structures, policies and procedures, information flows, culture and behaviors, skills, and infrastructure.

Best practices in governance

The ISO/IEC 38500:2015 standard defines six principles that are necessary for effective governance of IT in the organization:

Responsibility. Everyone within the organization understands and accepts their responsibilities both in terms of demand and supply of IT and have the authority to meet them.
Strategy. Business strategy take into account current and future IT capabilities, and the plans for use of IT support current and on-going business strategy.
Acquisition. All IT investments are made with valid reasons, on the basis of relevant analysis and transparent decision making, with and appropriate balance between benefits, costs, and risks to the organization.
Performance. IT is fit for purpose, providing services that meet the business requirements in terms of quality and service levels.
Conformance. The use of IT systems complies with all applicable legislation and regulations, as well organizational policies and practices which should be well defined, implemented and enforced.
Human behavior. Respect for human behavior is demonstrated in IT policies, practices, and decisions, even as needs evolve among all stakeholders.

Additional principles as defined by COBIT are that the IT governance system should:

Satisfy stakeholder needs and generate value from the use of information and technology.
Be built from a number of components that can be of different types and that work together in a holistic way.
Be dynamic, always considering the effect of changes to any of its design factors.
Clearly distinguish between governance and management activities and structures.
Be tailored to the enterprise’s needs, using a set of design factors as parameters to customize and prioritize its components.
Cover the enterprise end to end, focusing on all technology and information processing the enterprise puts in place to achieve its goals, including outsourced processing.

What Are APTs? Advanced Persistent Threats Explained

Joseph Mathenge — Fri, 25 Feb 2022 00:00:52 +0000

Today, the global economy is heavily centered on digital technology—and the value of data held by individuals and entities is now valued at a high premium.

As a result, cybercrime has become more and more sophisticated, especially where organized groups invest in skills, tools, and processes to take down targets and monetize the looted information. Be it government agencies, research institutions, or corporates, wherever valuable data can be found, these groups take their time to:

Investigate, infiltrate, and extract data
Extort a ransom
Damage IT systems

This type of long-term attack by specialist groups is called an advanced persistent threat (APT).

A report by ENISA, the EU Agency for Cybersecurity, showed that attacks conducted by APTs on EU institutions, bodies, and agencies increased by 30% in 2021. Just recently, the Red Cross detailed such an attack where personal data belonging to over 500,000 people was compromised. The attack was discovered on 18^th January, but it was determined that the intrusion occurred on 9^th November.

In this article, let’s do a deep dive on APTs including who they are and how they structure their attacks, and, more importantly, how to protect ourselves from such entities.

What is an APT?

An APT is a calculated network attack on any organization. These threats occur when a hacker, or group of hackers, establishes a foothold inside an enterprise network. APTs go undetected for prolonged periods of time, allowing for sensitive data to be mined.

The term APT references the type of attack—multi-stage in nature—but over time has been used to characterize the groups or the tools in use. The primary goal of APTs is data theft, but there is increasing evidence of other objectives such as:

Ransomware
Espionage
Systems disruption
Crypto mining

So, who is conducting APTs? The characteristics of such attacks indicate that the main players are well-funded entities who have the time, muscle, and laser-focused attention to get to their goal.

There is significant evidence that some of these groups are state sponsored entities, like APT27 and Winnti that are alleged to be Chinese sponsored, with the former recently flagged by the German government for attacks on government agencies. The US CISA has also raised an alert about Iranian sponsored APTs exploiting Fortinet and Microsoft Exchange vulnerabilities.

Trend Micro’s 2021 mid-year cybersecurity report listed the following groups (with interesting coined names) actively involved in ATP attacks:

Team TNT targeted AWS credentials and Kubernetes clusters for crypto mining.
Water Pamola targeted e-commerce shops in Japan with XSS script attacks.
Earth Wendigo targeted Taiwan institutions webmail with malicious JavaScript backdoors.
Earth Vetala targeted institutions in the Middle East using remote access tools to distribute malicious utilities.
Iron Tiger targeted institutions in Southeast Asia using a SysUpdate malware variant.

Lifecycle & characteristics of an APT

While no two APTs are the same, in general, advanced persistent threats operate in a systematic manner. The lifecycle of an APT happens in five stages, as listed below:

Stage 1: Targeting/Reconnaissance

Initially, an enterprise is targeted by hackers who seek to accomplish a singular agenda. Infiltrating occurs through identified weaknesses in the network, web assets, or other resources that hackers can gain access to.

Attackers will also use information from the internet and social media to identify contacts of potential victims to be targeted through social engineering attacks such as spear phishing.

Stage 2: Entry

Hackers gain access using SQL injections, RFIs, or implementing phishing scams that enable entry via user access points. Exploiting zero-day vulnerabilities in unpatched systems is fast becoming the go-to entry method for most APTs:

The Red Cross attack involved exploiting an unpatched critical vulnerability in Zoho ManageEngine ADSelfService Plus (CVE-2021-40539).
The ATP attack on a U.S. municipal government webserver involved exploitation of vulnerabilities on a Fortigate appliance, and the creation of an account with the username “elie” to enable further malicious activity.

Once inside a network, hackers will often create a backdoor by uploading malware that allows repeatable entry. In Germany, APT27 used the malware variant HyperBro remote access trojan to backdoor their networks from compromised commercial companies. Additional attacks may be used to create a smoke screen that allows hackers time to gain access undetected.

(Understand how vulnerabilities work.)

Stage 3: Discovery

Entry into the system is the first milestone for a hacker launching a calculated APT attack. The next involves taking steps to avoid detection. To do this, hackers will map out the organization’s infrastructure and launch additional attacks to the system, geared at gaining access to user accounts higher in the hierarchy. The higher in the hierarchy a malicious cyber attacker can get the better the access to sensitive information.

Post-exploitation activities identified in the Red Cross attack included compromising administrator credentials, conducting lateral movement, and exfiltrating registry hives and Active Directory files.

Stage 4: Capture

An infrastructure left vulnerable from multiple cyber-attacks is easier to move around in undetected. Under these conditions, hackers begin capturing data over an extended period of time. Capture can also include

Building stable remote control
Establishing communication with command-and-control centers

The hackers involved in the Red Cross attack deployed offensive security tools which allowed them to disguise themselves as legitimate users or administrators.

Stage 5: Data exfiltration

Once identified, infiltrators can deploy malware extraction tools to steal desired data. Usually this means creating “white noise attacks” to cover cyber attackers who want to mask their intentions. They also mask their entry point, leaving it open for further attacks.

An alternative is ransomware, where the ATP will encrypt the victim’s enterprise data and demand payment in cryptocurrency in exchange for decryption keys.

Identifying APTs: What to look for

If an enterprise business has been hit with an APT, it can be hours, days, or longer before they discover the problem. But time is of the essence when it comes to protecting your organization.

Monitoring your infrastructure for these signs can help you stay ahead of hackers who try to establish a foothold in your network:

Increase in late-night logging

Are employees suddenly logging in late at night? This could be a warning sign that your system has been exposed to cyber attackers gaining access to your employee’s log ins at night when no one is around to stop them.

What to do: If enterprise business leaders see this kind of activity, it should be a red flag to further investigate for vulnerabilities.

Trojans are prolific in the network

When hackers access a computer in a network, they often install a trojan which gives them total control over that machine, even after passwords have been updated for security.

What to do: If enterprise organizations have a network full of trojans, they should consider the possibility the network is under attack from an APT.

Unexpected data bundles

One way cyber attackers move data is by putting large amounts of data into bundles before shipping it out of the system.

What to do: Identifying unexpected bundles of gigabytes of data is a good indicator to check your enterprise infrastructure.

Unexpected data flows

One way to spot an APT is to look for unexpected flows of data. These could be computer to computer, server to server, in or out of network. In order to identify whether an information flow is unauthorized or unexpected, you have to know what’s reasonably expected within your current infrastructure.

What to do: Define reasonable expectations for data flows and monitor for discrepancies.

Final thoughts

Advanced persistent threats are complicated, calculated, long-game attacks that can have devastating effects on an enterprise business and, unfortunately, cannot be easily predicted. However, enterprise organizations don’t have to be at the mercy of APTs. You can implement strategies that include:

Continuous automated patching
Advanced endpoint detection and response monitoring systems
Multi-factor authentication and strong password protection mechanisms
Response planning to create a big picture of what to do if a breach occurs

Deploying AI and ML based security solutions can be highly effective in detecting anomalous behavior, which is one of the hallmarks of an APT attack.

Risk Management: A Complete Introduction To Managing Enterprise Risk

Joseph Mathenge — Tue, 18 Jan 2022 00:00:28 +0000

Global pandemics such as the scale of Covid-19 or the Spanish Flu have an annual occurrence probability that varies between 0.27% and 1.9%. And while organizations with robust enterprise risk functions had identified pandemics as one of their risks, the low probability meant that few had put in place measures to mitigate against the potential occurrence.

Safe to say, we have all been schooled at the moment.

From cyberattacks to air crashes, third party compromise to regulatory changes, employee unrest to economic downturn, the business environment is rife with uncertainties. Having an approach to anticipate and limit the impact should such materialize is critical for any enterprise that wants to remain.

As an organization defines its strategic goals and objectives, a realistic look at threats to success can go a long way in enabling the enterprise to remain on track. Investing in a risk management approach is the mark of mature companies who are well aware that the path to their vision is not always straightforward.

Let’s look at some of the key aspects define risk management.

What is risk?

The ISO 31000 standard for risk management guidelines defines a risk as:

The effect of uncertainty on objectives.

The outcome of the uncertainty can swing in either a positive or negative manner. If the risk is negative, then the uncertain outcome results in harm or loss for instance lost customers, regulatory penalties being imposed or reduced business revenue. On the other hand, if the risk is positive, the uncertain outcome can result in benefits if exploited e.g., regulation changes can be favorable in terms of new business opportunities.

Elements of risk

To fully express a risk, one has to consider the following elements:

Risk source. An element which, alone or in combination, has the potential to give rise to risk. Examples here include weather conditions, government agencies, disgruntled employees, etc.
Risk event. The potential occurrence or change of a particular set of circumstances. For example: a cyberattack, flooding of a data center, mass resignation, adverse regulation, etc.
Risk consequence. The outcome of an event affecting objectives. For instance lost revenue, penalties from a regulator, disrupted operations, corrupted data, etc.
Risk likelihood. The chance of something happening—for instance, low or high probability which can be objectively or subjectively computed.

Responding to risk

In order to effectively respond to risks, an approach is required. That’s where risk management comes into play.

Defining risk management

ISO 31000 defines risk management as

Coordinated activities to direct and control an organization with regard to risk.

ITIL^® 4 outlines the purpose of the risk management practice is to ensure that the organization understands and effectively handles risk to guarantee ongoing sustainability and value co-creation.

Principles for effective risk management, as outlined in ISO 31000 include, ensuring that your risk management practice:

Creates and protects value.
Is made an integral part of all organizational processes.
Is made part of decision making.
Explicitly addresses uncertainty.
Is systematic, structured, and timely.
Is based on the best available information.
Is tailored.
Takes human and cultural factors into account.
Is transparent and inclusive.
Is dynamic, iterative, and responsive to change.
Facilitates continual improvement of the organization.

(Learn more about risk management in ITIL 4 & ITIL v3 environments.)

Risk management steps

Let’s look at a couple well-known frameworks.

Management of Risk framework

At a high level, the risk management process can be broken down into five iterative steps as outlined by Axelos’ Management of Risk (M_o_R) framework:

M_o_R Risk Management Process

1. Identify

The organization identifies its strategic and operational context, and then identifies the risks based on that context. The context leads to a determination of the organization’s capacity and tolerance to risks should they materialize. Risks identified are documented in a risk log or register.

2. Assess

The risks identified are then assessed to determine the likelihood and consequence. This then leads to an evaluation of the assessment to rank the risks from a priority perspective, where risks with higher consequence and likelihood are prioritized higher. A risk heat map is a tool that can be used to visualize risk prioritization.

3. Plan

Planning involves identifying and evaluating the appropriate risk response to remove or reduce threats, and to maximize opportunities. Responses can be categorized as follows:

Avoid: Making the uncertainty void by not proceeding with the plan of action where the risk would materialize. For example, not hosting your data on the cloud due to risk of transfer of personal data outside local jurisdiction.
Reduce: Identify actions to reduce the probability and/or consequence should the risk materialize by putting in place mitigation controls. For example, putting policies to prevent senior officials from travelling on same flight or vehicle.
Transfer: Identify a third-party who is willing to take up the risk on behalf of the organization. This option is usually tagged to insurance covers.
Share: Identify a third-party who is willing to take up part of the risk with the organization. This option is usually applied to customers, partners or suppliers.
Accept: Live with the uncertainty and take no action to forestall it.

4. Implement

Here the planned risk responses will be actioned, their effectiveness monitored and corrective action taken where responses do not match expectations.

5. Communicate

This is a standalone step that occurs concurrent to the previous four. Risk information and treatment status is reported to key stakeholders based on agreed channels. This step is also very relevant whenever an identified risk materializes.

NIST risk management framework

The NIST risk management framework (RMF) provides a comprehensive, flexible, risk-based process that integrates security, privacy, and cyber supply chain risk management activities into the system development life cycle through 7 steps outlined below:

NIST RMF Steps

Prepare. Carry out essential activities to help prepare all levels of the organization to manage its security and privacy risks.
Categorize. Determine the adverse impact with respect to the loss of confidentiality, integrity, and availability of systems and the information processed, stored, and transmitted by those systems.
Select. Select, tailor, and document the controls necessary to protect the system and organization commensurate with risk.
Implement. Implement the controls in the security and privacy plans for the system and organization.
Assess. Determine if the controls are implemented correctly, operating as intended, and producing the desired outcome with respect to meeting the security and privacy requirements for the system and the organization.
Authorize. Provide accountability by requiring a senior official to determine if the security and privacy risk based on the operation of a system or the use of common controls, is acceptable.
Monitor. Maintain ongoing situational awareness about the security and privacy posture of the system and organization to support risk management decisions.

Risk Management Roles

Now that we understand the purpose and steps in any risk management practices, let’s look at the people involved. Key roles required for effective risk management in an organization include:

Risk Committee. This is a subset of the organization’s board whose mandate is the oversight and approval of the enterprise risk management framework. This includes defining risk tolerance and appetite, providing resources for risk mitigation, setting governance policies, and evaluating performance of the implemented risk mitigation.
Risk Manager. This role is responsible for coordinating the implementation of the enterprise risk management framework including guiding the rest of the organization in identifying, assessing, mitigating, and monitoring risks. The role will provide reports on the status of the risk management framework and can be elevated to Chief Risk Officer or Head of Risk depending on the size of the organization.
Risk Officer. This role reports to the risk manager and carries out the basic risk management activities and maintains documentation on the same.
Risk Owner. This role is responsible for the management, monitoring, and control of all aspects of a particular risk assigned to them, including the implementation of the selected responses to address the threats or to maximize the opportunities.
Risk Actionee. This role is responsible for implementation of selected risk responses. It can be carried out by the Risk Owner or be outsourced to a third party.

Success factors in risk management

Success in risk management is a chance in itself—that’s because you can never plan perfectly (unless you can see the future). However, having a robust yet flexible framework can be the difference between successfully navigating through a challenging risk or seeing your enterprise going under.

Key elements required in successful risk management according to the ITIL 4 practice guide include:

Establishing governance of risk management
Nurturing a risk management culture and identifying risks
Analyzing and evaluating risks
Treating, monitoring, and reviewing risks

The IT Vendor Management Office (VMO) Explained

Joseph Mathenge — Wed, 12 Jan 2022 00:00:55 +0000

In today’s digital age, one distinct element that determines an organization’s competitive edge is the quality of services provided by vendors within its value chain activities. Consider estimates from Gartner that companies will spend $474 billion on cloud services in 2022—just one example of how vendors are becoming critical in digital service delivery.

The management of the supplier ecosystem is a critical success factor for any enterprise, as customer experience and satisfaction are largely determined by vendor-supported touchpoints and interactions. And with enterprises sourcing a variety of services from multiple vendors, having an approach to coordinate them effectively is fast becoming an invaluable capability.

Enter the vendor management office (VMO), a cross-functional business unit responsible for the implementation of an enterprises’ vendor strategy with a view of providing visibility across the value chain. The VMO provides value to the organization through lowered operational costs, eliminating duplication of resources and strengthening relationships with these key partners in the service delivery journey.

Let’s consider how the VMO does this.

What is Vendor Management?

A vendor—or supplier, or seller—is an organization that contributes goods or services to another organization within a supply chain. According to Gartner, vendor management helps organizations “control costs, drive service excellence and mitigate risks”, all in pursuit of increasing value and return on investment.

Effective vendor management ensures that the organization’s vendors and their performances are managed appropriately to support the seamless provision of quality products and services. According to ITIL® 4, vendor management activities involve:

Establishing a common approach to sourcing strategy and management of vendor relationships
Maintaining a single point of control over active and planned vendor contracts and services

Depending on the organization, vendor management can be positioned as a strategic, tactical, or operational capability:

At the operational level, vendor management activities are simply an extension of the enterprise’s procurement function. On the other hand, the tactical level has a more specific focus with individual business units responsible for management of their own vendors.

At the strategic level, vendor management oversees value creation and preservation through a standardized sourcing delivery model across the entire organization, providing end-to-end visibility and ensuring alignment to strategy, managing risks and costs effectively, while building relationships that are mutually beneficial to both parties.

Role of a Vendor Management Office

We can infer from the SIAM body of knowledge that in IT environments where many services are commoditized and multiple vendors need to work together, organizations require the right vendor management capability in order to:

Understand the end-to-end picture of service provision
Coordinate the activities of multiple service providers
Provide a single source of truth regarding service performance
Be a trusted partner in developing new services and strategies
Optimize delivery through people, processes, tools, and vendors
Ensure smooth performance of day-to-day operations, enabling them to concentrate on more progressive activities

In the paper How to Build a VMO, ISG state that the Vendor Management Office VMO is a critical governance and oversight tool that is designed to facilitate collaboration within the sourcing model as well as alignment between the sourcing model and the business. The role of the VMO is primarily overseeing five disciplines of service delivery within the sourcing lifecycle:

Contract

The VMO ensures contract rationalization and alignment across the enterprise. Contract information is made visible to key stakeholders and managed effectively, especially towards the end of the vendor contract, ensuring timely renewal or transition where required.

Finance

The VMO manages financial aspects of the vendor engagement from negotiations during sourcing, onboarding/offboarding cost management, tracking of spending throughout the life of the contract, and working with the business to determine value from the vendor engagements including return on investment.

Relationship

The VMO will establish and nurture links between the organization and vendors at all levels. Relationship management will include identification of shared or mutual goals, promotion of no-blame cooperative and collaborative culture, supporting continuous learning, maintenance of open and transparent communication, as well as handling conflicts through mediation and other mechanisms.

Performance

The VMO will track performance of vendors to ensure organizational goals and objectives are met. This includes ensuring alignment between the organization’s service level targets in client SLAs, and the targets for vendors defined in the contracts.

Compliance

The VMO will liaise with the legal department to ensure that the vendor contracts and agreements clearly spell out compliance requirements that the vendors should adhere to such as:

Corporate policies
Data privacy
Other appropriate legal and regulatory requirements

Benefits of a Vendor Management Office

The benefits of a successful VMO are numerous. WGroup, a management consulting firm with deep experience in IT optimization, rightfully notes that revenue enhancement is the #1 objective a VMO. But that is just the start.

Chuck Crafton of the Project Management Experts group, has identified these critical benefits—among many others—that should result from implementing a VMO:

Improved supplier relationships (coordination, collaboration, and communication)
Centralized procurement and contract management
Improved governance, resulting in consistency and compliance
Dispute and issue resolution management
Vendor risk identification, assessment, and mitigation management

Tools for VMO success

An effective VMO office should be equipped with appropriate communication, collaboration, and monitoring tools to support the vendor lifecycle.

For most organizations, the ERP is the primary source of vendor information as relates to procurement of services and products, and asset management. However, other organizations leverage heavily on supplier and contract management information systems that not only manage documentation but also provide vendor performance monitoring functionality.

Best practices for Vendor Management Offices

CIOs and other professionals responsible for ensuring the VMO is producing value should consider the following activities to guide their approach:

Provide guidance during RFP creation. The VMO should provide templates and best practices across business units for RFP processes.
Develop more structured approach to negotiations. The VMO should formalize the negation processes and leverage performance metrics for renegotiations.
Help put better contracts in place. The VMO should be a critical stakeholder in the review and management of contracts, and work on systematically improving contracts over time.
Regularly evaluate relationship and performance management. Business managers should see higher quality and lower costs from suppliers as a result of VMO’s work.
Solicit feedback on VMO performance. A best practice for a VMO is to seek feedback from stakeholders regularly after sourcing events to track on internal satisfaction with the process.

Observability vs Event Management: What’s The Difference?

Joseph Mathenge — Tue, 21 Dec 2021 14:27:42 +0000

When it comes to alerts and alarms, there is only one movie scene that comes straight to mind: “Houston, we have a problem!” The explosion of one oxygen tank on the Apollo 13 lunar mission’s set off a colorful array of buttons on the rocket as well as at the mission control, leading to hectic but heroic efforts to bring the astronauts back to earth.

Similar scenes are replayed daily in IT departments globally as monitoring systems bring notice of failed components or degraded performance. That’s why having an approach to detect such events and alerts—and respond appropriately—is a critical capability for any service management organization.

While event management has been a mainstay of service management for years, observability has recently risen to prominence as the go to approach for modern tech organizations. But are they really very different? This article breaks down the two approaches.

Observability basics

The term observability is not new but was first established under control systems engineering. Here, observability is defined as the ability to measure the internal states of a system by examining its outputs. When trying to understand observability, we look at it as a characteristic rather than an activity. In other words, the more observable a system is, the better placed we are to pinpoint the reason for things going wrong.

Where monitoring refers to that activity of looking at the change of state in a system to determine if it is working well or not, observability is more about the actual capability of the system and its management to effectively convey the reason for the change of state. So, we can actually look at observability from two perspectives:

The design of the system
The capability of monitoring tools

(Compare observability & monitoring.)

The design of the system

The Google Cloud Architecture Guides on DevOps indicate that systems are instrumented with code or components that expose the inner state. A well instrumented system will aid observability as the system itself provides quality outputs that reveal its true internal state.

For instance, during development or deployment, code is added to the software system to keep track of connection pool information such as unused connections, failed connections etc., which can be exposed through:

Observability & event management (OEM) tools
Scripts
A third-party monitoring solution

The capability of monitoring tools

Observability has been primarily marketed (rather ‘overhyped’) as a hallmark of modern monitoring tools, particularly application performance monitoring (APM) solutions. These solutions include features that can collect, analyze, and correlate eternal state data from a variety of telemetry sources such as logs, metrics, distributed traces, and user sessions.

The use of such tools, leveraging artificial intelligence (AI) and cloud centric capabilities such as CI/CD, provide the means of keeping up with visibility requirements for the ever-evolving landscape of modern technology systems like:

Microservices
Containers
Serverless functions

Observability in ITSM

In the world of technology service management, observability is a critical differentiator in faster detection and resolution of incidents and problems that would plague our applications or infrastructure, resulting in bad customer experience as well as lost business outcomes.

According to Ubuntu, the degree of observability in a system depends on the quality of telemetry information collected and the way it is processed, which enables one to know and investigate in a timely fashion how the system is performing, what issues are occurring and what their impact is.

This is especially important when trying to address service quality, especially when one considers benefits such as reduction in Mean Time to Restore Service which is a key customer experience indicator.

Event management overview

Event management is the practice that acts on monitored changes of state of services and their associated components, by determining their significance, and identifying and initiating the correct response to them.

(Read our event management explainer.)

According to the ITIL® 4 practice guide related to this topic, information about events is also recorded, stored, and provided to relevant parties. Events materialize when a set threshold is passed (could be a warning or an exception) which triggers a pre-defined response such as:

Creating an alert or other notification
Creating an incident
Changing a status of a previously recorded alert or notification
Initiating a reactive action towards the respective component or service

From a process perspective, the event handling process relies on inputs from system notifications and monitoring tools outputs which are then taken through the following activities as guided by a monitoring plan:

Event detection
Event logging (for significant events)
Event filtering and correlation check (might be iterative)
Event classification (critical, major, medium, minor)
Event response selected
Notifications sent, response procedure carried out

These activities can be manual or automated depending on the service provider organization’s capabilities, and result in appropriate responses including event analysis, incident management and stakeholder engagement. It is clear that event management is not simply the action of responding to system alerts, but rather an all-encompassing capability that requires people (roles), information and technology, processes and where required partners and suppliers for success.

(Learn about the people, process, technology & partners paradigm.)

Drawing the line between observability & event management

The evolution from ITIL v3 to ITIL 4 saw a change of name for this key practice (previously process) from “Event Management” to “Monitoring and Event Management”. The rationale behind this decision was informed by the fact that monitoring is a trigger for event management, but not all monitoring results in the detection of an event.

So, can we say that observability is only related to monitoring? Not quite, as can be seen that the value of observability spans across the design and development lifecycle of systems.

Some of the benefits of observability, as identified by IBM, include having systems that are easier to understand, monitor, update, and repair, leading to higher quality and ultimately meeting business and customer needs. But from the understanding of the activities of event management, it is obvious that the value of observability can only be fully achieved when a mature and improving event management practice is in place.

The VeriSM Management Mesh Beginner’s Guide

Joseph Mathenge — Mon, 15 Nov 2021 07:49:08 +0000

There is more than one way to skin a cat. And the same can be said when it comes to service management. With so many approaches, methodologies, and standards, it can get quite confusing for anyone depending on their evolving context.

Of course, there are pros and cons for each approach, and what’s good for the goose might not be the best for the gander. One body of knowledge has a different approach—why not pick what works for your context?

The VeriSM approach that was unveiled in 2017 does exactly that, by helping organizations to evolve their operating model in a flexible and responsive way, through an integrated selection of management practices. Let’s explore what that looks like in detail.

(New to VeriSM? Start with our VeriSM introduction.)

VeriSM intro

The VeriSM approach was developed through a partnership with the global service management community in response to changing demands on service management and the impact of digital transformation. Aptly titled ‘an approach for the digital age’ the term VeriSM stands for:

Value-driven: focus is on providing value
Evolving: an updated approach which will continually evolve
Responsive: facilitating a tailored approach based on context
Integrated: fitting different practices together
Service
Management

VeriSM is opposed to a ‘one size fits all’ approach on the basis that organizations are different in many aspects such as size, portfolio, culture, market, customer segments, etc. This approach is unique in that it:

Doesn’t tie organizations to a single management product
Allows the operating model to change when required

The VeriSM model starts with the consumer defining their outcomes and ends with the consumer verifying that the outcomes have been achieved leading to value. For the service provider, adopting the VeriSM model requires the following components as shown in Figure 1 below:

Governance
Service Management Principles
The Management Mesh
Stages a product or service moves from requirements definition to provision to a customer i.e. Define, Produce, Provide and Respond

The VeriSM Model

What is the Management Mesh?

A cursory glance at the VeriSM model would immediately identify the common elements you expect to see in any service management approach. The only differentiator would be the management mesh which is pretty unique. So, what is it?

The Management Mesh is a concept that proposes a method to manage and use the multitude of service management frameworks, standards, methodologies, management principles and philosophies. The mesh is defined by four elements that influence or directly contribute to product and service delivery:

Resources. What the service provider leverages to create products and services e.g., assets, budget, people, time, knowledge, suppliers, etc.
Environment. The service provider’s operational context including internal and external factors such as competition, regulation, culture and service stabilizers like processes, tools and measurements.
Management practices. Existing bodies of knowledge on service management approaches such as agile, Lean, ITIL^®, DevOps, COBIT, SIAM, ISO/IEC 20000, etc.
Emerging technologies. Advances in modern technology that the service provider can deploy to develop and improve its products and services such as cloud technologies, mobile applications, cognitive technologies, internet of things (IoT), etc.

The combination of these elements like strands in a woven fabric provides the capability that a provider of services is able to leverage upon to meet customer needs effectively. The Management Mesh can be visualized with example elements as shown in Figure 2 below:

The VeriSM Management Mesh

Succeeding with the Management Mesh

The Management Mesh is one of the trickier components to get to grips with when it comes to the VeriSM model.

It is the sauce that makes the VeriSM model flexible and adaptable for any organization regardless of context. The specific elements of mesh need to be defined by the organization, including the measurement scale or model used to quantify the different contributing elements of the mesh.

This is carried out in alignment to the four stages of the model as follows:

Define & Produce stage. The organization conducts a current state assessment of the four elements. Then the requirements of the future desired state are clearly defined with respect to the four elements. The organization then compares the two states and identifies gaps, before creating a solution to meet the consumers’ needs/requirements, plus a new/changed mesh.
Provide stage. The organization provides the changed/new solution to the consumer based on the new/changed mesh.
Respond stage. The organization handles requests and addresses and addresses issues using the elements in the new/changed mesh.

A pictorial representation is shown below. The green lines represent the current state, the red lines the desired future state, and the blue lines the new mesh after the gaps have been identified and a solution developed.

Progressing the Management Mesh through the VeriSM model stages (Source: VeriSM: Unwrapped and Applied)

One of the great things about the Mesh is that it provides a clear visual presentation of what an organization should consider in the journey of service management. The folks at VeriSM have developed a tool that organizations can use to understand or build their own mesh.

Challenges with the Management Mesh

Well, no approach is perfect. The VeriSM model is actually pointing that out by saying that one needs to consider the different tools in their box and figure out two things:

What works for their context
What is missing that needs to be brought in

This is invaluable advice, as an organization can introspect from strategic and tactical perspectives, and identify an improvement roadmap to meet customer outcomes and deliver value.

However, challenges arise if the organization doesn’t understand those elements, or doesn’t have the capacity to acquire and adapt the required elements to fill the identified gaps.

Should the organization find itself in this position, then the mesh simply becomes a good piece of artwork to hang on the office walls. For example, if the context requires SIAM as a management practice, or IoT as an emerging technology, then lack of knowledge or financial resources, or lack of the right culture to change the way of working will invariably lead to a hurdle for the organization in getting to where it needs to be.

In order to get to where the VeriSM management mesh points, organizations must therefore be prepared to:

Invest significantly in knowledge and resources
Go through organizational change

Another challenge could be complexity especially where the number of components in one element are much more than others. A square looking management mesh is perfect, but might not be ideal. An organization might end up with something akin to a strange trapezoid which might not be visually appealing when it comes to identifying gaps.

Finally, the organization might decide to have multiple meshes for different service offerings or evolve rapidly over the four elements, leading to challenges in keeping track with up-to-date management meshes that should be available to all stakeholders.

Major Network Outages of 2021

Joseph Mathenge — Tue, 12 Oct 2021 00:00:26 +0000

Perhaps the biggest effect of the digital age is that connectivity is a basic need for all. There is little wonder that enterprise risk management considers network outages as a top tier risk.

83% of respondents from an Open Gear survey reported that network resilience was their number one concern
92% reported financial loss from network outages.

Despite significant efforts to limit the impact of network unavailability through redundancy and other means, providing 100% uptime still remains a major challenge, mainly due to unforeseen factors.

So, let’s take a look at some recent outages that have significantly impacted users around the globe.

(Understand the impact of redundancy on availability.)

Facebook’s social gaffes

On Monday, October 4^th, 2021, some form of online world peace was experienced for approximately six hours following a network outage on Facebook together with its associated services WhatsApp and Instagram.

In a detailed post, their Infrastructure VP explained how a configuration change took down all the connections in their backbone network, effectively disconnecting their data centers from the rest of the internet. This then resulted in a second problem where their DNS servers disabled the BGP advertisements due to inability to reach the data centers, causing all DNS queries to their services to go unanswered.

Unfortunately, due to security measures which depended on the network to work, data center engineers faced challenges while attempting to physically access the backbone network routers to reconfigure them manually.

Summarizing this outage, CloudFlare called the episode:

“A gentle reminder that the internet is a very complex and interdependent system of millions of systems and protocols working together.”

This was the second major outage affecting the social media giant in 2021, with the first occurring on 19^th March for 45 minutes, affecting the same services. A Facebook spokesperson later said that the outage was due to a technical issue that had since been resolved.

Fastly goes slow after network bug

On June 8^th, 2021, Fastly had an outage that lasted almost an hour, causing major websites such as Amazon, eBay, Reddit, Spotify, Twitch, The Guardian, The New York Times, and even the UK government’s websites to be unreachable.

The company is one of the world’s leading Content Delivery Networks, and as a CDN it runs an edge cloud network which brings web content closer to users, thereby reducing latency, while also facilitating handling of traffic spikes and offering protection from DDoS attacks.

Fastly explained that the previous month, a software deployment introduced a latent bug into their network. This bug was then triggered by a configuration change pushed by a customer, resulting in their network returning errors in 85% of routing requests. Users reported getting 503 errors, meaning there was a temporary problem accessing the web hosting servers.

The team at Fastly were quick to isolate the cause and disable the configuration, before turning their attention to deploying a bug fix and carry out a postmortem on preventive and corrective measures to prevent recurrence.

Cloudflare & Akamai’s bottlenecks

In recent times, both Cloudflare and Akamai experienced network outages, resulting in service unavailability for many of their customers’ end users.

Cloudflare, which handles approximately 18% of all web traffic, experienced a network outage that impacted 50% of its traffic resulting in major websites being unreachable for around 27 minutes. Websites impacted included Shopify, Discord, and AWS.

The incident on 17^th July 2020 was a result of a configuration change made on their backbone network to alleviate congestion. Unfortunately, an error routed all the BGP traffic to another backbone router in Atlanta which became overwhelmed, resulting in congestion and subsequent errors. To resolve the issues, the Atlanta router was dropped from the network and traffic rerouted to other routers.

On the other hand, Akamai’s edge DNS had an issue that impacted quite a number of websites globally on 22^nd July 2021 for about an hour. Given that the company boasts of having 85% of the world’s Internet users being within a single “network hop” of an Akamai CDN server, downtime would be felt significantly across the world.

Services affected included PlayStation Network, Airbnb, FedEx, and UPS. In a series of tweets, Akamai reported that a software configuration update triggered a bug in the DNS system, resulting in the incident.

Rolling back the update addressed the issue, but the damage had already been done.

Freak case: South Africa’s slow internet

In January 2020, a freak occurrence of two undersea internet cables suffering breakdowns at the same time resulted in slow internet speeds for South Africa and nearby countries.

The South Atlantic 3/West Africa (SAT-3/Wasc) submarine cable which links Portugal and Spain to South Africa, and the West Africa Cable System (Wacs) which links SA with the UK, both suffered breakdowns near Gabon and Congo respectively. A second cut on the Wacs cable near the UK was later discovered, compounding the problem, according to reports.

Traffic was rerouted to other undersea cables, while repair ships were marshalled to restore the connectivity. Unfortunately, delays were experienced due to the time it took to prepare for such an operation, as well as high winds in the Atlantic Ocean.

It took several weeks for services to be completely restored on the two cables.

A future with no outages?

The internet powers today’s economy. Customers want faster access to the data they need, whether for business or personal use.

The rise of edge computing to supplement the cloud cannot be ignored as it addresses the needs for low latency and high resilience, through CDNs and other technologies. However, implementing redundancy at the distributed level, as well as providing onsite support in case of emergencies will most likely result in increased risk of outages.

Configuration changes as a source of major network outages is also evident as some of the examples have demonstrated. It is likely that as more complexity is introduced, chances of random bugs being introduced that cannot be spotted by existing test scenarios will increase.

So, what’s the key takeaway? The focus for service providers will be to:

Build more layers of resilience through redundancy
Limit impact through distributed networks and faster restoration mechanisms

IT Agility Explained: Achieving Agility Across the Enterprise

Joseph Mathenge — Fri, 01 Oct 2021 00:00:53 +0000

The demands for businesses and IT to be quicker in responding to the ever-evolving customer and operating environment do not seem to be slowing anytime soon. Among the leading forces of change in the tech space McKinsey lists:

Digitization
Globalization
Automation
Analytics

And then COVID-19 joined the game and massively disrupted operations while accelerating the digitization journey for almost all organizations. This of course forced laggards who had never considered remote working, cloud, apps and social media to immediately get on the bandwagon or risk oblivion from the unexpected pandemic risk materializing.

In the ever-evolving world of information technology, being able to speedily, yet effectively, respond to market changes is a difficult but critical task.

As organizational capabilities change, so must IT capabilities—and sometimes it becomes necessary to reconfigure or completely replace organizational structures, processes, or systems in response to evolving marketplace realities. At the same time, though, there is still a desire for control and stability, hence any lingering hesitation to embrace all change in IT environments.

An increasingly common suggestion for how businesses can achieve all these things effectively is through IT agility.

What is IT agility?

In general, agility is a common business term that refers to how fast an organization responds to opportunities. It is typically recognized as the time in between an organization becoming aware of a potential business opportunity and acting on it.

The ITIL^® 4 Foundation publication defines organizational agility as the ability of an organization to move and adapt quickly, flexibly, and decisively to support internal changes. This could include:

Strategy, practices, or technology requiring different skills or organizational structure
Changes to relationships with partners and suppliers

IT agility, then, is a measurement of how efficiently the IT infrastructure of an organization can respond to external stimuli.

This can mean how effectively it embraces the pressure to change or how successfully it creates a new opportunity. Instead of being thought of as another task to complete, IT agility should be viewed as more of an overall mindset, eventually becoming part of the company culture.

While there are many approaches to IT agility, the agile manifesto has been associated as the go to reference in the world of software development, by embracing frequent delivery and welcoming changing requirements. As opposed to “waterfall” methods, an iterative incremental delivery approach is applied through bi-weekly or monthly sprints. At the end of each sprint, the work and project priorities are evaluated, which allows for client feedback to be incorporated, as well as including improvements and changes.

(See how agile & service management work together.)

Principles of IT agility

There are a variety of common principles that can be gleaned from the agile manifesto in IT agility including:

Satisfy the customer through early and continuous delivery
Deliver updates frequently through bi-weekly or monthly sprints
Cultivate an environment of changing requirements
Pay special attention to technical excellence and good design
Promote strong communication between business people and developers
Keep it simple
Encourage continuous reflection on progress as well as what improvements can be made

But one must be careful not to implement IT agility in a manner that is at the expense of business outcomes and value. A fragmented approach can create other bottlenecks in the flow of work resulting in frustration and misalignment, cost overruns, high technical debt, and ultimately unhappy customers.

In addition, speed sometimes can come at the expense of completion and quality. A holistic approach is more prudent, tapping into organizational strategy and customer experience at the heart of it.

The Harvard Business Review gives six principles for building a company’s strategic agility that could also be considered as the premise for IT agility:

Prioritize speed over perfection
Prioritize flexibility over planning
Prioritize diversification and “efficient slack” over optimization
Prioritize empowerment over hierarchy
Prioritize learning over blaming
Prioritize resource modularity and mobility over resource lock-in

How can you achieve IT agility?

It is important to remember that IT agility is not a quick project that can be executed over a long weekend; IT agility requires an entire shift in the company’s ethos and thinking.

(Learn about Lewin’s three steps to change.)

Once everyone is onboard with this change, an evolving plan should be put into place to map out short-term and long-term strategic goals. If you have a solid outlook on where you would like your systems to be going, then it makes it that much easier to select the appropriate opportunities that will get you there once they come around.

To begin creating this plan, the business must first reflect on some of the key factors that are driving the application of agility to begin with:

Are systems tightly coupled and opposed to change?
Are deployment schedules constrained due to testing complexities and integration dependencies?

By finding out what is encouraging the change, it will help goals be specific and relevant.

Sure, becoming more agile might seem like an extensive plan and complete overhaul of the system is needed. But keeping it all as simple as possible is an important mantra to remember.

For the most part, re-purposing components and systems that are well-designed will be a quick and consistent way to cover the various problem spaces. However, where there exists rigidity in IT infrastructure, it will always be a stumbling block—the digital age cannot wait for traditional approaches to acquiring and provisioning of underlying components to support software.

Migrating to the cloud, infrastructure automation, CI/CD technologies, and SRE practices can help deliver the flexibility and self-service capability needed by developers to get the infrastructure they require on demand.

People are at the heart of any transformation. IT agility is no exception.

Hierarchical structures with their ‘command and control’ can end up limiting speed when it comes to decision making and allocation of resources. The same can be said of traditional change advisory boards (CABs) that stand in the way of quick decision making.

Transforming to matrix structures that are adept at quick allocation/reallocation of resources to priority needs is preferable. In addition, permanent, simple multi-competent teams that are assigned to work exclusively on a product can provide the autonomy needed to both:

Make faster decisions
Deliver features and solutions quicker

In this holistic approach, other elements throughout the value creation and preservation journey must be considered. IT agile requires agility in budgeting, contracting, procurement, and any other practice that is involved in IT activities.

This requires not just agile processes—agile mindsets are more importantly.

Agility is a mindset, not a switch

IT agility is about far more than just adopting new strategic plans and development practices. It requires an entire rethinking of the IT organization to successfully meet the intended goals and move one step closer to complete enterprise digital transformation.

Starting efforts towards an agile IT is not easy, but once the process gets underway, many organizations see improvements extremely fast. Changes early on can free up resources, which allows IT to better support the digital transformations and improve further developments.

There is a direct correlation between having an agile IT and increasing the Time to Value (the time between an initial request and its delivery) for a business. IT agility is the no longer the wave of the future but an immediate imperative: are you ready?

Change Management Explained: Change in Service Management, DevOps & More

Joseph Mathenge — Tue, 14 Sep 2021 07:16:09 +0000

Translating customer requirements into actual products and services is one of the main value streams of service management. And the activities behind the building, testing and deployment of these products and services are usually enabled and controlled by the change management practice.

According to VeriSM, change management is normally implemented as a process that:

Reviews and approves (or rejects) a proposed change
Manages it through its development and deployment

At the heart of controlling changes is risk management, to protect the service provider and its customers from unnecessary negative impact of changes, including service degradation and regulatory penalties.

Successful service management requires proactively managing system changes associated with configurations, resource provisioning, and service management. These changes are not always planned, often emerging as unforeseen consequences of a service disruption. Technology dependencies in a complex IT infrastructure environment can cause the impact of a small change in the infrastructure to escalate across the IT environment and impact multiple users at scale.

Organizations therefore need to adopt prudent change management strategies, solutions, and practices that help manage service and component changes, and mitigate the associated risks. Let’s take a look at change management activities, tools, and best practice approaches.

What is change?

The ITIL^® 4 change enablement practice defines change as:

“the addition, modification or removal of anything that could have a direct or indirect effect on services.”

Though service management frameworks vary in best-practice workflows, most follow a standard hierarchy of change:

Standard change

A low-risk change that is pre-authorized and follows documented tasks per a change model, which outlines a repeatable workflow to manage such changes.

Examples include an IT service request from the service desk platform.

Normal change

An intermediary or high-risk change that cannot be categorized as an urgent issue or a pre-approved change process. A thorough review process is required before approving such changes.

Emergency change

Urgent changes that may present high-risk consequences if not addressed promptly. An emergency change may need to be resolved to avoid a major incident or to resume normal operations following the incident occurrence.

Examples include:

An upgrade to address an active information security threat
Failover to an alternate data center due to power outages

Change management activities

The main activities involved in managing changes include:

Recording changes. Identifying the description of the change, who and what is involved, and potentially what is impacted. These records might be maintained in an ITSM solution or a Kanban board.
Planning changes. Considering resources—time and business needs—to determine the best way and scheduling for successful implementation. Plans would also include testing activities, as well as measures to be taken should the change be unsuccessful including rolling back to previous versions.
Approving changes. Using appropriate mechanisms to approve the change having considered the plans and risk mitigations. Approval mechanisms include automated workflows, peer reviews, or designated authorities based on the change level and preferred approach.
Communicating changes. Planned and approved change schedules would need to be communicated to relevant stakeholders to ensure alignment and preparedness for all eventualities.
Review changes. After execution, conduct a post-implementation review to consider what went well, what went wrong, and what opportunities for improvement exist.

Change management approaches

The approach to handling changes differs from one organization to another and depending on the types of change.

Software based changes can be automated end to end, by taking advantage of CI/CD technologies to execute changes frequently and quickly. For example, Etsy is known to perform 50 deployments per day through a fully automated and continuous delivery pipeline.

Physical infrastructure changes such as equipment installation may be slower, requiring a staged approach.

When it comes to organizational approaches, here are some common ones:

Some organizations choose to freeze changes during certain peak periods to reduce the risks of service outages and poor customer experience.
Other organizations have a stringent governance approach to change approval, needing a select group of people (like a Change Advisory Board—CAB for short) to hold scheduled meetings to discuss and approve all major changes.
Others decentralize change authority depending on service ownership and risk levels.

What matters most in all these approaches is delivery of value for the organization, be it risk mitigation, agility, or reliability.

Prioritizing change

The approval process necessary for a change can depend on decision criteria that’s unique for every organization. The decision criteria are based on priority, resource, cost, business need and other factors.

A common approach to prioritize change requests is by considering the impact and urgency.

Impact

Impact evaluates the business impact of a proposed change request. It also accounts for potentially damaging consequences resulting from unsuccessful execution of a change that are not previously considered.

The ranking may range from minor impact to extensive impact.

Urgency

Urgency evaluates the necessary time for a change to realize an Impact. A change request that requires quick implementation or one that must be initiated early to account for prolonged implementation duration is ranked with high urgency.

The ranking may range from Low Urgency to High Urgency.

Priority

Priority indicates the relative importance of a change request is determined by correlating the Impact and Urgency.

Here is an example of a priority matrix:

(Deep dive into the impact urgency priority matrix & this BMC documentation.)

DevOps & change management

DevOps is a philosophy and a movement focusing on organization-wide collaboration to support the delivery of value to the organization and its customers. And at the heart of this collaboration is bringing together developers—who introduce change—and operations, who have to manage the effects of change.

A change management approach that does not have collaboration as one of its core values invariably leads to conflicts between those who desire change versus those who are pro-stability or risk averse. This has played out in many organizations where the change management process aspects (such as CAB meetings) end up becoming a bottleneck rather than a facilitator for change.

Because the digital age has necessitated faster delivery of new features, many organizations now have digital transformation at the heart of their strategy. And for tech functions, this means adopting technologies such as cloud and CI/CD, and approaches such as Agile and DevOps to meet these business needs.

The change management approach therefore has to evolve to suit this modern way of working. The change types therefore become a risk-based basis for the organization’s change models.

Most standard changes can therefore go the automation route and require very little interference from upper management. If the change integrates, and all tests pass, then you push it into production.

Only high-risk changes that can severely impact the organization are subjected to appropriate levels of scrutiny during the approval process.

By the time these types of changes are being reviewed by upper management, all the technical teams already collaboratively participated in the planning process and are fully aligned in readiness for the change. This then shortens the time required to approve the change.

Success in change management

What ever the approach, there are four success factors that ITIL 4 suggests any organization practicing change management should aim for:

Ensuring that changes are realized in a timely and effective manner.
Minimizing the negative impacts of changes.
Ensuring stakeholder satisfaction.
Meeting change-related governance and compliance requirements.

The rapid evolution of customer needs and technology environments means that no one-size fits all approach can satisfy the needs of effective change management.

Organizations must consider appropriate governance frameworks that control the risks involved during changes, but at the same time do not introduce unnecessary bureaucracy that can prevent or delay value creation through changes.

Database Administrator (DBA) Roles & Responsibilities in The Big Data Age

Joseph Mathenge — Mon, 16 Aug 2021 13:24:29 +0000

Back in 2017 when The Economist famously declared “Data is the new oil!”, they were simply stating the obvious that today’s most valuable companies are the ones that make the most of the data in their possession—whether willingly given or not.

Data is the lifeblood of any organization, and the management of data in IT systems remains a critical exercise, particularly in a time where data privacy regulation is a hot topic.

In this context, the role of the Database Administrator (DBA) has likely evolved over time, given the evolution of data sources, types, and storage options. Let’s review the current status and see what the future holds for DBAs.

What is a DBA?

Short for database administrator, a DBA designs, implements, administers, and monitors data management systems and ensures design, consistency, quality, and security.

According to SFIA 8, database administration involves the installing, configuring, monitoring, maintaining, and improving the performance of databases and data stores. While design of databases would be part of solution architecture, the implementation and maintenance of development and production database environments would be the work of the DBA.

(Read our data architecture explainer.)

What does a DBA do?

The day-to-day activities that a DBA performs as outlined in ITIL^® Service Operation include:

Creating and maintaining database standards and policies
Supporting database design, creation, and testing activities
Managing the database availability and performance, including incident and problem management
Administering database objects to achieve optimum utilization
Defining and implementing event triggers that will alert on potential database performance or integrity issues
Performing database housekeeping, such as tuning, indexing, etc.
Monitoring usage, transaction volumes, response times, concurrency levels, etc.
Identifying reporting, and managing database security issues, audit trails, and forensics
Designing database backup, archiving, and storage strategy

Are you ready to harness the power of data? See how DataOps with BMC can transform your analytics. ›

What competencies does a DBA require?

At a bare minimum, the DBA will:

Have an IT, computer science, or engineering educational background
Need to be conversant with structured query language (SQL) and relevant database technologies (whether proprietary or open source)
Understand coding and service management (to some degree)

Relevant database technologies include SQL Server, MySQL, Oracle, IBM Db2, and MongoDB, among others. Now, this doesn’t mean you have to be certified in all of them, but a working knowledge of a few of them is required.

The European e-Competence framework (e-CF) outlines five associated competencies that the DBA should have. These competences are all proficiency level 3 (on a scale of 1 to 5):

e-CF Area	e-CF Competence	Level 3
Build	Application Development	Acts creatively to develop applications and to select appropriate technical options. Accounts for others development activities. Optimizes application development, maintenance and performance by employing design patterns and by reusing proved solutions.
Build	Component integration	Accounts for own and others’ actions in the integration process. Complies with appropriate standards and change control procedures to maintain integrity of the overall system functionality and reliability.
Run	Change Support	Ensures the integrity of the system by controlling the application of functional updates, software or hardware additions and maintenance activities. Complies with budget requirements.
Run	Information and Knowledge Management	Analyses business processes and associated information requirements and provides the most appropriate information structure.
Manage	Information Security Management	Evaluates security management measures and indicators and decides if compliant to information security policy. Investigates and instigates remedial measures to address any security breaches.

A cursory search across popular talent recruiting websites indicates additional soft skills needed by DBAs include:

Business awareness and understanding of business requirements of IT
Excellent problem-solving and analytical skills
Good communication, teamwork, and negotiation skills
Good organizational skills
Flexibility and adaptability
Excellent business relationship and user support skills

DBA career development

SFIA 8 defines four levels of responsibility for the DBA which you can map to your career development roadmap:

Level 2 (Assist)

Assists in database support activities

Level 3 (Apply)

Performs standard database maintenance and administration tasks
Uses database management system software and tools to collect performance statistics

Level 4 (Enable)

Develops and configures tools to enable automation of database administration tasks
Monitors performance statistics and create reports
Identifies and investigates complex problems and issues and recommends corrective actions
Performs routine configuration, installation, and reconfiguration of database and related products

Level 5 (Ensure, Advise)

Identifies, evaluates, and manages the adoption of database administration tools and processes, including automation
Develops and maintains procedures and documentation for databases. Contributes to the setting of standards for definition, security, and integrity of database objects and ensures conformance to these standards
Manages database configuration including installing and upgrading software and maintaining relevant documentation
Monitors database activity and resource usage. Optimizes database performance and plans for forecast resource needs

Experience the power of efficient workflow orchestration with Control-M! ›

Outlook for DBAs

The DBA role is here to stay when it comes to data administration, but it is clear that the name might need some tweaking.

The digital age has resulted in the huge growth in unstructured data such as text, images, sensor information, audio, and videos, on account of e-commerce, IoT, AI and social media. As a result, the job title ‘database administrator’ seems to be giving way to ‘data administrator’, to cater for management of both structured (database) and unstructured (big data) data sets.

Since most digital organizations are no longer restricted to transactional data only, the modern day DBA must be conversant with file, block and object storage solutions.

And because of the sheer volume of data, as well as the ability to access AI/machine learning solutions to digest such data, the preferred data storage mode for most digital organizations is cloud based. Therefore, the modern DBA must become fully conversant with cloud architectures and technologies, including data lakes and big data solutions like Hadoop.

The rise of DevOps as the preferred model for end-to-end product management means that the DBA must become a comb-shaped specialist, working in an autonomous environment with platform engineers to develop automated self-service tools that software developers can utilize to create the data solutions they require for their applications.

This means the DBA will need to build software engineering capabilities as part of their repertoire.

Leverage automation powered by AI and machine learning to provide world-class data management with BMC AMI Data. ›

DBAs must acknowledge data privacy

Data protection regulation has become a key focus area for enterprises around the world. The stringent requirements and hefty fines have resulted in scrutiny of data management becoming a critical corporate governance imperative.

The DBA must become conversant with data protection regulation such as GDPR, and how to implement the relevant security controls to ensure user/customer privacy rights are respected in business operations.

Joseph Mathenge – BMC Software | Blogs

IT Governance: An Introduction

What is IT Governance?

Good vs bad governance

Best practices in governance

Related reading

What Are APTs? Advanced Persistent Threats Explained

What is an APT?

Lifecycle & characteristics of an APT

Stage 1: Targeting/Reconnaissance

Stage 2: Entry

Stage 3: Discovery

Stage 4: Capture

Stage 5: Data exfiltration

Identifying APTs: What to look for

Increase in late-night logging

Trojans are prolific in the network

Unexpected data bundles

Unexpected data flows

Final thoughts

Related reading

Risk Management: A Complete Introduction To Managing Enterprise Risk

What is risk?

Elements of risk

Responding to risk

Defining risk management

Risk management steps

Management of Risk framework

1. Identify

2. Assess

3. Plan

4. Implement

5. Communicate

NIST risk management framework

Risk Management Roles

Success factors in risk management

Related reading

The IT Vendor Management Office (VMO) Explained

What is Vendor Management?

Role of a Vendor Management Office

Contract

Finance

Relationship

Performance

Compliance

Benefits of a Vendor Management Office

Tools for VMO success

Best practices for Vendor Management Offices

Related reading

Observability vs Event Management: What’s The Difference?

Observability basics

The design of the system

The capability of monitoring tools

Observability in ITSM

Event management overview

Drawing the line between observability & event management

Related reading

The VeriSM Management Mesh Beginner’s Guide

VeriSM intro

What is the Management Mesh?

Succeeding with the Management Mesh

Challenges with the Management Mesh

Related reading

Major Network Outages of 2021

Facebook’s social gaffes

Fastly goes slow after network bug

Cloudflare & Akamai’s bottlenecks

Freak case: South Africa’s slow internet

A future with no outages?

Related reading

IT Agility Explained: Achieving Agility Across the Enterprise

What is IT agility?

Principles of IT agility

How can you achieve IT agility?

Agility is a mindset, not a switch

Related reading

Change Management Explained: Change in Service Management, DevOps & More

What is change?

Standard change

Normal change