Taming the Mainframe Storage Monster
Contents
How do Mainframe Storage Issues Compare to Open Systems Issues?
Challenges of Mainframe Storage Management
MAINVIEW SRM Addresses these Challenges
SMS has been enormously successful, but there are limits…
MAINVIEW SRM Reduces Complexity
Introduction
With all the press coverage about the explosive growth of open systems, IT management may overlook the fact that mainframe-based DASD is still the backbone of most large enterprises. More and more, it has become obvious that the mainframe is a large, robust server whose role is to service mission-critical IT needs.
The qualities of mainframe systems that make them so valuable are industrial-strength processing capabilities and precise control systems that can control the rate of work from single host systems to huge parallel sysplex configurations. Indeed, it is the fact that these systems are so capable of managing themselves that led to the shrinking z/OS staff. Windows and UNIX environments still require a small army of administrators, but in the mainframe environment the work gets done, on time, day in and day out.
Yet, despite the lack of visibility, there is a crisis brewing in the mainframe space. Growth slowed during the recession, but it did not stop. Head count was trimmed, but deadlines remained the same or accelerated. Systems that worked with occasional oversight became “exception-only.” Storage Administrators characterize their work as “fire-fighting” because they are constantly being pulled in multiple directions to alleviate storage issues.
In other words, all the things occurring in the open systems world (such as rapid expansion) were taking place on a smaller scale in the mainframe world, albeit with a much smaller cadre of support staff.
How do Mainframe Storage Issues Compare to Open Systems Issues?
The issues affecting mainframe storage tend to be similar in many respects to their open-systems counterparts. Disk arrays are physically identical, with differences deeply buried in the microcode of the arrays. Storage Area Networks (SANs) are little more than the open-systems version of the ESCON and FICON that have been prevalent in mainframe systems for years.
The service requirements (performance, availability, recoverability, and problem management) of enterprise data are indistinguishable by platform. Even the desired frequency of these services is similar (for example, the notion of a system shutdown for backup is inconceivable on all but the smallest servers). Where there are differences, it is because of the operating system and the services it provides to the user. The overwhelming majority of mainframe systems are effectively “clustered servers” with multiple systems directly sharing data. Unlike their mainframe counterparts, Windows and UNIX-based servers are still mainly standalone servers.
Challenges of Mainframe Storage Management
Despite the fact that the storage hardware is virtually identical for both open systems and mainframes, there are several unique challenges confronting mainframe Storage Administrators. The leading issue is training. Most mainframe Storage Administrators receive their training on-the-job. Learning the proper management philosophy is sometimes a hit-or-miss proposition. Management techniques for performance, availability, recoverability, authorization, and devices are all significantly different from other environments. This paper points out areas of concern for Storage Administrators in mainframe IT installations and why it is so important to have a tool in place that can provide a “standard” process-oriented viewpoint into issues related to Storage Management.
Performance
Consider the broad mix of workloads that run on mainframe systems. Sure, there are monitoring tools, but where is the problem? How do you pin down the issue? DASD is one of the few mechanical devices left in the data center, but it doesn’t take much for DASD issues to affect system performance in a major way. Conversely, adding cache memory to avoid I/O wait can have a dramatic effect on performance. However, solving storage-related performance issues is rarely as simple as adding cache to an array. The symptoms are usually non-specific workload slowdowns, so just identifying the issue can be a struggle. Compounding the issue is the fact that mainframe DASD are all virtualized volumes mapped onto sophisticated arrays. Delving through the layers of abstraction to find and correct I/O hot spots can easily be a full-time occupation all by itself.
In most mainframe environments, the Storage Administrator spends a large amount of time working with the Database Administrator (DBA). DB2® and other database management systems use huge quantities of storage resources. Where general users might request DASD in hundreds of megabytes or a few gigabytes, DBAs demand space in quantities at least an order of magnitude greater (gigabytes or terabytes). As if that wasn’t enough, many DBAs also insist on hand placing their databases to “ensure” good performance. (Humor them; it is all virtual DASD anyway.)
A Storage Administrator’s job frequently encompasses interactions with Performance Analysts. A rough estimate is that about 20% of a Storage Administrator’s time is spent worrying about performance-related issues, and of that percentage, the vast proportion of time spent concern transaction processing and databases (e.g., CICS, IMS, and DB2).
Availability
There are many different ways to describe availability. Not that many years ago, the discussion of availability centered on device availability. Today, with RAIDx technologies in most of the disk arrays, a device can fail within the array and access to data may not be affected. (These failures are characterized by a Service Technician showing up at the door with a replacement part after the array “phoned home.”)
Now, availability discussions are usually focused on two areas: access to data and space availability.
In terms of access to data, a data set might be created on tape in a manual tape library with the issue of a human finding and mounting the tape (a real challenge when the tape is located off-site). In other sites, a robotic tape library might be used, but in those cases, there are issues with the availability of cartridge slots (e.g., how “full” is the library?) or how busy the robot or the tape drives are. Virtual tape libraries can circumvent some of those issues by deferring the placement of data on real tape, but virtual tapes aren’t portable to off-site locations.
For space availability purposes, availability is considered as the space that can be used for new allocations within a storage group. Hierarchical Storage Management (HSM) tries to keep storage groups at desired occupancy levels, by migrating, deleting, or consolidating data sets based on their System Managed Storage (SMS) attributes. Sometimes storage groups can fill to the point that new allocations fail, and sometimes the space simply gets so fragmented that it is impossible to allocate the amount of space needed without exceeding the system’s allocation rules.
Yet another aspect of availability is mirroring. Remote mirroring is typically used for business continuance, and local mirroring is typically used for data replication or “snapshot”. Mainframe systems almost exclusively choose hardware-based functions built into arrays for these types of availability services. This is related to hardware management because these functions are invoked through operating system commands that are sent to the array for interpretation and execution.
Hardware Management
Some Storage Administrators do nothing other than install and rotate DASD through the data center. When new arrays arrive, they need to be initialized and have data moved onto them. Then the old arrays are cleaned off and re-purposed. Most data centers institute a “hand-me-down” program so that new, high-performance DASD go to the most important applications, and then that application’s DASD is transferred to the next-most important application, and so on. The process of adding new DASD can easily become a very time-consuming proposition for the Storage Administrator.
Installing new microcode or firmware can also be a difficult task for the Storage Administrator. Installations that are running hardware near the edge of its capacity have found that the safest method of deploying new microcode is to ensure that it does not create new errors in test environments before moving into production environments. This exercise may require a series of regression tests or studying the documentation to ensure that applicable tests get run. Few problems are harder for Storage Administrators to troubleshoot than microcode errors.
As previous stated, remotely mirrored DASD is usually managed through hardware-based functions, such as the EMC Symmetrix Remote Data Facility (SRDF), or the IBM* Peer-to-Peer Remote Copy (PPRC) or Asynchronous Remote Copy (XRC). It is important to ensure that these functions are always operating as designed to avoid dangerous “single-points-of-failure” within the enterprise. In most cases, it is the responsibility of the Storage Administrator to set up and manage these functions in the two or more data centers where the DASD arrays are located.
Recoverability
With the need to recover from device failures in decline, there is still the issue of applications that create “bad data.” Application and data set recovery is never fun—it is always a “pressure” situation, and sadly, a situation that can cause small problems to snowball into major issues. In many installations, the Storage Administrator is responsible for recovering data. The usual exceptions are databases and (sometimes) the operating system itself. Databases have very specialized tools for backup, but the majority of the data sets in the installation are handled by HSM, Data Set Services (DSS), or another vendor product. What sometimes gets lost is the relationship between data sets that has to be preserved. In most cases, recovering one data set leads to having to recover another and then another.
Authorization
Storage Administrators are responsible for the corporate crown jewels. Yet in many instances, the practice of securing authorization to these resources is left to staff or individuals unfamiliar with the specific data protection mechanisms built into the system. In the case of storage management, certain authorization profiles need to be in place to prevent the system or users from inadvertently causing damage. For example, SMS introduced the requirement that every managed data set must have a catalog entry; however, even today, it is not widely known that the security profiles to ensure catalog entries are not created by default.
In large multi-system environments Global Resource Serialization (GRS) or its equivalent is not a “nice-to-have;” it is required to prevent systems from damaging one another. Components like SMS and HSM depend on multi-system serialization functions to keep critical system structures safe. The most frequent reason for HSM’s control data sets getting corrupted is that serialization specifications are set incorrectly.
The same philosophy is true for catalogs--they must be locked to ensure that users don’t inadvertently damage these critical system structures. Even where tape data sets are concerned, the Storage Administrator gets involved because today’s environments depend on the integrity of catalogs.
Problem Management
There is never a good time for a problem, and they never seem to come one at a time. However, a large portion of a Storage Administrator’s time is spent in problem management. Given that the Storage Administrator is responsible for the corporate data, to shirk this area would only lead to bigger problems down the road. The types of problems are always varied, but generally fall into one or more of the major areas we’ve already discussed. If any one particular problem eats into a Storage Administrator’s time more than another it is probably still the “space jockey” problem of adding DASD, moving files, or finding data sets that users have inexplicably lost.
MAINVIEW SRM Addresses these Challenges
Training and a thorough knowledge of what to do under what circumstances are the biggest issues for the mainframe Storage Administrator. MAINVIEW® SRM is enormously helpful at solving this issue. The reporting and automation capabilities of MAINVIEW SRM monitor the storage configuration for warning signs of impending problems ranging from SMS, HSM, Tivoli Storage Manager (TSM), and storage devices to other resources. Finding the problem at its early stages is not enough; MAINVIEW SRM also consolidates disparate information into one, easy-to-use application, eliminating the problem of finding and selecting a tool from the large number of tools most installations have on hand. Indeed, in many installations, just finding the appropriate tool can be a major headache.
Once the issue has been found and a solution developed, MAINVIEW SRM provides exhaustive testing and validation facilities to ensure that the solution works as intended. Odd as it might sound, the most rigorous and comprehensive testing of proposed SMS policy changes is generally done by the seasoned expert Storage Administrators because they have the greatest level of familiarity of the unforgiving nature of storage management.
Regardless of whether the familiar ISPF interface or the Windows-based MAINVIEW Explorer is chosen, MAINVIEW SRM makes it simple to navigate and isolate issues for resolution. For example, most mainframe Storage Administrators have a number of recurring “problem child” applications or data sets that seem to require attention (because of performance, availability, or other issues). MAINVIEW SRM allows the Storage Administrator to group these watched data sets so that they can be monitored and given the special attention they need. Most important of all, these data sets can be monitored across the entire sysplex from a single screen to ensure that critical events are not missed. The sysplex-wide scope of MAINVIEW SRM is very important. A sysplex is a complex clustered set of systems. What takes place on one system in the cluster can easily affect the whole sysplex.
As the figure below shows, Storage Administrators must be intimately familiar with serialization technology (typically GRS), the Sysplex and coupling facility structures, and SMS to understand and interpret just about any issue that arises within their area. Storage management is multi-disciplinary, and it requires frequent technical interactions with a broad range of IT personnel.
![]()
The advanced MAINVIEW system architecture operates with intelligent storage devices (including EMC’s Symmetrix, IBM’s Shark arrays, and Hitachi’s Lightning and Thunder arrays) to drill down into complex array functions. Most of the high performance disk arrays have their own user interfaces. What makes MAINVIEW SRM so much more useful than the standard vendor-provided tools is the ability to drill down from user data sets, through emulated mainframe volumes, to vendor disk subsystems to diagnose and correct performance and availability issues. The “stock” vendor-provided tools typically display information only from an array-centric view and miss out on important operating system information.
Space availability (or rather, the lack of space) is still one of the major burdens for Storage Administrators to resolve. The monitoring, reporting, and automation capabilities of MAINVIEW SRM can easily prevent runaway storage allocations, and monitor pools and storage groups for exception conditions. One of the key parts of BMC Software’s storage solution is the enhanced flexibility of allocation with MAINVIEW SRM. The Storage Administrator has much greater control over how much, where, and how space is used, leading to a much more efficient storage environment.
SMS has been enormously successful, but there are limits…
IBM’s System-Managed Storage (SMS) was introduced more than a dozen years ago, but it still represents a major discriminating factor between mainframes and open systems storage. What has made SMS so successful is that it has altered the storage equation. Previously, when there was a shortage of disk space on a user volume, the job would fail, then the help desk or production control would call the Storage Administrator to free sufficient space to run the job. SMS, working with HSM, streamlined that process and moved inactive data out of the way before it caused a problem.
Today’s issues arise from the fact that in a large proportion of installations, few people know what HSM is doing or whether it is making good decisions about what to move or where to move it within the storage configuration. All too often, data are thrashing, in constant motion between SMS DASD volumes and HSM migration tapes. Imagine a juggler, forced to keep more and more objects in the air. Not only does the risk of something breaking increase, but the juggler tires quickly as more and more items are introduced. That describes SMS in many of today’s data centers.
The Storage Administrator needs to monitor the effectiveness of SMS and HSM, as well as ensure that performance, recovery, availability, and other data center objectives are met. To accomplish this task requires a tool such as MAINVIEW SRM. MAINVIEW SRM understands and interprets SMS and HSM activities into easily understood reports to show the effectiveness of their storage operations.
MAINVIEW SRM Reduces Complexity
Most of today’s large OS/390 and z/OS configurations are sysplexes because these large, loosely coupled, multisystem configurations deliver huge amounts of throughput. MAINVIEW SRM allows deep analysis of sysplex-wide storage issues without having to bounce around telnet sessions to individual systems in the sysplex. MAINVIEW SRM solves the vast majority of SMS and non-SMS allocation problems including those associated with databases and other critical applications. The Storage Administrator no longer has to worry about SMS deciding that high-performance databases can be allocated on the same volume just because there is enough space to put them there. MAINVIEW SRM helps the system to manage storage more effectively and efficiently. Furthermore, the automation functions of MAINVIEW SRM ensure that both routine and exception operations are not left to chance.
In today’s fast-paced, unforgiving, storage environment, learning on the job is like a “trial by fire.” MAINVIEW SRM gives the Storage Administrator the information to solve problems on the first try by immediately getting to the root cause of a problem.
About BMC Software
BMC Software, Inc. [NYSE:BMC], is a leading provider of enterprise management solutions that empower companies to manage IT from a business perspective. Delivering Business Service Management, BMC Software solutions span enterprise systems, applications, databases and service management. Founded in 1980, BMC Software has offices worldwide and fiscal 2003 revenues of more than $1.3 billion. For more information about BMC Software, visit www.bmc.com
| 43358 |