BMC Mainframe: Troubleshooting z/OS Operational Problems
The course is developed and delivered by © RSM Technology.
This course describes and explains what can go wrong in an IBM Z Systems environment, and what you can do about it as an operator or systems programmer. It looks at failure situations from many points of view: the physical computer rooms, hardware problems and the software environment.
The software environment is further examined by looking at the Recovery Termination Manager (RTM) - the 'cleaning-up' function of z/OS - and its ABEND concept.
All the different reports that come out of a z/OS system in conjunction with failures (messages, dumps, traces, etc.) are also discussed. The most common reasons for system ABENDs (and how you can analyze the information coming out of the system when they occur) are also covered.
Major release:
BMC Mainframe Infrastructure Platform Training
Recommended Prerequisites:
Good for:
System Programmers, Users
Course Delivery:
Instructor-Led Training (ILT) | 24 hours
Course Modules
-
What is an Operational problem?
- The z/OS mainframe - a large system
- What can go wrong?: the operational view, the application view
- The 'hole in the ground'
- Loss of electric power: preparing for power failures
- Hardware problems - total loss of critical components
- Critical system software failure
- Partial loss of hardware
- Partial loss of system software
- JES problems
- SMF problems
- Application systems failures
- Performance degradation
- Actions: Summary
- Review questions
-
z/OS Software Environment
- The z/OS environment - a lot of programs
- Software categories
- The mission of an Operating System
- Workload in MVS
- Asking for MVS services
- Asynchronous MVS activities
- Asynchronous (unwelcome) z/OS activities
- Summary
- Review questions
-
Recovery Termination Manager (RTM)
- Normal Program Termination
- Abnormal program termination
- Why abnormal termination?
- Logical application error
- Program incomplete
- Application detected software error
- System detected software error
- Hardware detected software error
- Program Checks in the Supervisor
- Hardware problems
- RTM actions
- System breakdown
- Software problem types
- Review questions
-
z/OS Error Reporting & Dumps
- System error reporting
- z/OS dumps
- Stand-Alone Dump (SADUMP)
- SVC dumps
- user ABEND dumps
- Generating a user ABEND dump
- System generated ABEND dump
- Snap dumps
- Symptom dumps
- Review questions
-
ABEND Analysis
- What is ABEND?
- The z/OS ABEND service
- Tasks in an Address Space
- How RTM is invoked
- Why not normal end?
- Application detected software errors
- System detected software errors
- All the system ABEND codes
- Where do you see the ABEND codes?
- The NOTIFY message
- The System SYSLOG
- The job log
- The symptom dump in the SYSLOG
- The symptom dump in the job log
- Explanations of ABEND and reason codes
- Analysis approach
- Examples of ABEND code explanation
- System messages - a good information source
- System message prefix
- Message level
- Message identifier and z/OS components
- Common message identifier groups
- Examples of system messages
- Explanation of system messages
- Common system ABEND codes
- System ABEND code numbers
- Common SVCs and their macros
- The x22 codes - caused by outside events
- The x13 codes - OPEN problems
- Example of S013-18
- 806 - Program not found
- Example of S806-04
- S804, S80A, S878, S822 and DC2 - virtual storage problems
- The Virtual Address Space
- Virtual Storage requests
- Limitations on Virtual Storage
- ABEND and reason codes
- The REGION limit
- The effects of different REGION values
- Example of ABEND S822
- The MEMLIMIT parameter
- Example of ABEND SDC2
- The S0Cx codes
- Running RTM1
- PC FLIH and ABENDs
- The meaning of Program Checks
- Program Check codes
- Common ABENDs from Program Checks
- Storage Protect Keys
- Virtual address protection
- Reasons for translation exceptions
- Other S0Cx ABENDs
- The S0E0 and 0Dx codes
- The Sx37 and SB14 codes
- How disk datasets are allocated
- Physical Sequential (PS) datasets
- Example of unavailable primary allocation
- Example of SD37-04
- Example of ABEND SB37-04
- Partitioned Data Sets (PDS)
- Problems when allocating a PDS
- Example of ABEND SE37-04
- Summary of common x37 ABENDs
- Example of ABEND SB14
- Partitioned Data Sets Extended (PDSE)
- Summary of common system ABEND codes
- Other ABEND codes
-
The Hardware - CPU & Storage
- A mainframe installation - a lot of hardware
- The hardware components
- Real storage
- Virtual Storage
- The CPU
- Controlling the modes - PSW
- PSW control bits
- Where do you find the PSW?
- Why look at the PSW
- Disabled Wait
- Enabled Wait
- Enabled Loop
- Disabled Loop
- Review questions
-
The Hardware - Input/Output Processing
- I/O devices
- Control Units
- I/O processing in principle
- Defining the I/O Configuration
- the Hardware System Area (HSA)
- the z/OS configuration
- The I/O users in z/OS
- Review questions
-
Hardware Errors & Recovery
- What is System Recovery?
- Hardware error types
- Machine Check processing and MCIC
- Soft errors
- Hard errors
- Terminating errors
- Hardware error areas
- Soft CPU errors
- Soft CPU error reporting
- Hard CPU errors
- The effect of hard CPU errors
- Terminating CPU errors
- Processing terminating CPU errors - one CPU
- Processing terminating CPU errors - multiple CPUs
- Service Processor damage
- Storage errors
- Soft storage errors
- z/OS action after soft errors
- Hard storage errors
- Effect of hard storage errors
- I/O and Channel Subsystem errors
- Channel Subsystem error reporting
- Channel Path recovery
- Terminal error condition
- Permanent error condition
- Initialized condition
- I/O related errors
- Device/Control Unit errors (I/O errors)
- No path available
- Device status errors
- Subchannel status errors
- Hot I/O conditions
- Hot I/O recovery
- Hot I/O messages (non-DASD)
- Hot I/O messages (DASD)
- Response to Hot I/O message
- Using IECIOSxx for Hot I/O processing
- HIO options in IECIOSxx
- Example of IECIOSxx parameters
- Missing Interrupts
- Missing Interrupt intervals
- Special considerations for MIH intervals
- Missing Interrupt messages
- I/O Timing facility
- I/O Timing messages
- Review questions
-
LOGREC and EREP
- The Error Recording Data Set (ERDS) of MVS
- LOGREC in z/OS
- LOGREC contents
- LOGREC Event Record types
- Re-initializing LOGREC with IFCDIP00
- Re-allocating LOGREC with IFCDIP00
- The EREP program
- EREP reports
- Controlling EREP
-
Generalized Trace Facility (GTF)
- Traces in MVS
- What is GTF?
- How to obtain a GTF trace
- The GTF JCL procedure
- Starting GTF
- Traceable events
- GTF parameters - I/O events
- Examples of I/O parameters
- CCW tracing example
- CCW tracing output
- Dispatcher events
- External interrupts
- Program interrupts
- GTF-tracing of VTAM activity
- SVC interrupts
- Recovery routines and SLIP events
- Parameter summary