The Case for Host Processor Compression
Introduction
For most applications, there are significant advantages to using host processor data compression over DASD subsystem data compression. Processor compression results in I/O resource savings, potentially delaying expensive hardware upgrades. Operational benefits of processor compression include reduced elapsed times and improved throughput and response times. Processor compression can optimize savings based on specific data attributes, unlike subsystem compression, which offers no optimization. Processor compression eases the impact of the 4 GB VSAM limitation, and will operate on any DASD subsystem. Due to its lack of flexibility and complex storage structure, DASD subsystem compression may introduce undesirable operational impacts and complicate capacity planning efforts.
Some Definitions
Host compression refers to compression that is performed using central processing unit (CPU) resources, or cycles.
Subsystem compression refers to compression that is performed by microcode built into a DASD subsystem. Compression subsystem refers to a DASD subsystem that uses internal compression.
I/O resources are the hardware and microcode resources built into the computer system, existing between the CPU and the DASD subsystem.
Compression techniques are specific well-defined algorithms used to compress data, and implemented in software or microcode.
Resource Utilization Attributes of Subsystem and Processor Compression
Data compression is not free. Resources are required in the host CPU or the DASD subsystem. Host compression uses CPU cycles to compress and expand data, either by software algorithm or microcode assistance. Better job throughput or transaction response time, and better DASD subsystem performance, normally offsets the CPU resource used by host compression.
Subsystem compression uses the hardware of the subsystem controller to compress and expand data. Data from the host is compressed before storing the data into the subsystem cache memory. Data read by the host is expanded before returning it to the host. The subsystem will also compact the data, removing interrecord gaps and unused space from the functional track before storing the data.
Comparison of Compression Techniques
There are many algorithms for compressing data, each with its own level of compression achieved and resources required for the compression process. Most subsystem compression schemes use the Ziv-Lempel compression technique. This technique looks for repeating patterns in the data and encodes the patterns into shorter strings reducing the length of the original data. The expansion process reverses the process to yield the original data. The compression ratio is the amount of data compressed versus the original data size. Subsystem compression would typically yield a 3:1 to 3.6:1 compression ration or a 66% -75% reduction in the amount of data to store.
Host compression using Data Accelerator Compression (DAC) offers seven different compression techniques including run-length encoding, Ziv-Lempel, static and custom Huffman, and arithmetic adaptive. DAC allows a technique to be chosen that will yield the best compression ratio for a particular data type. The compression ratio varies according to the DAC technique used and in most cases meets or exceeds the compression ratio for subsystem compression.
The following chart is a relative comparison of compression ratios for some of the various DAC techniques. The ratios are based on a 114- Mb MVS SYSLOG data set.
Legend
RC Run-length encoding
EC Extended Character; modified Ziv-Lempel
SH Statichuffman
HW Hardware; bit-level compatible IBM® hardware compression
CD Custom Dictionary; hardware plus custom dictionary
Subsystem Compression Considerations
On the surface, subsystem compression appears to be very simple and straightforward. You install and configure the subsystem, move your data to it and voila! your problems are solved. In fact, there are some very important factors to consider before making the decision to allow a subsystem to compress some or all of your data. Some key considerations are:
Subsystem compression is not selective; it compresses everything. This may seem inconsequential after all, do you really care if the subsystem compresses data that does not need to be compressed? From the standpoint of compression, you probably don't. However, there are real costs associated with this lack of selectivity, most notably performance degradation.
Potential performance degradation is introduced by the subsystem. Due to the nature of the subsystem, it is inherently slower than a subsystem that does no compression. The average service time for an I/O on a compression subsystem is 8 ms, compared to 3 ms on a noncompression subsystem. Since all data is compressed and must be decompressed, moving highly active datasets such as page datasets and catalogs to a compression subsystem automatically guarantees a service impact of almost 200%.
Subsystem compression does nothing to alleviate demands on I/O resources. In fact, installation of a large compression subsystem can force more activity onto fewer channels while doing nothing to reduce the quantity of the data, possibly aggravating existing I/O resource constraints and potentially introducing new ones.
The virtual nature of the compression subsystem leads to a duality of views and can complicate capacity planning and tuning efforts. Functionally, the subsystem can be configured to overcommit physical capacity. Determining the actual usage of the subsystem requires more than the use of traditional capacity measurement tools. Standard DASD reports can be meaningless for charge-back and capacity planning needs, and the capacity load measurements in the compression subsystem are arcane and too general to be of much value. Answering the simple question How much of our DASD capacity are we using? can be difficult or impossible. Also, while non-compressed DASD can be 100% used with no performance impact, the exact point at which a compression subsystem starts to thrash may be well below that number.
The impact of adding more data to a compression subsystem cannot be determined ahead of time, as there are no customer tools for estimating the compressed size of the new data or its effect on the capacity load. The only way to determine whether more data will fit without causing performance problems is to place it on the compression subsystem and see what happens.
Processor Compression Considerations
The only potential consideration for using processor compression is processor capacity, or cycles, required to compress and expand data without having a negative impact on the rest of the workload. It turns out that in most cases the processor overhead is very small and the resulting reduction in I/O resource consumption and DASD space more than offsets the cost.
Subsystem compression benefits
Subsystem compression will compress all data that is stored in the subsystem. There is no consideration to what data sets may be good candidates for compression or not so good. If the subsystem is sized correctly prior to installation, then the benefit is that more data can be stored functionally than available physical subsystem storage. The subsystem will also compact the track image by removing unused track space and not writing interrecord gaps between blocks of data.
Processor compression benefits
Host compression processing offers a number of benefits:
- Compression is selective. Only those data sets specified will be compressed. High performance datasets such as MVS catalogs and page datasets do not have to incur the overhead of compression and only datasets that will benefit from compression will be compressed.
- Host compression using DAC offers a variety of compression techniques so that compression performance can be optimized to the data, including the ability to do custom compression based on the actual data. A trial utility is provided with DAC that will sample data and compare the compression ratio for each technique so that the best technique can be specified. If no technique is picked, DAC will default to a dynamic adaptive algorithm that will sample the data and choose from Ziv-Lempel or static Huffman based on the best compression ratio. If no ratio can be calculated, the arithmetic adaptive technique is used.
- Host compression reduces the space utilization of the DASD subsystem. Compressing the data on the host requires less physical storage space in the DASD subsystem.
- Host compression can reduce I/O resource constraints. By compressing data on the host, more data can be written to or read from the DASD subsystem for each I/O request. The dataset size will be smaller on the subsystem requiring less I/Os to read or write the dataset. Less I/O means better DASD subsystem performance. DAC writes the compressed data in half-track blocks. If the original dataset has a poor block size, then DAC can reblock the dataset transparent to the application, improving performance and saving subsystem space. With the DASD subsystem performance gained by host compression, job throughput, transaction response time and job elapsed time are improved.
- Host compression can also aid in operating system restrictions such as the 4 GB VSAM limitation. With host compression, the dataset size will grow more slowly and allow time for conversion to extended addressability or perhaps prevent the dataset from reaching the 4 GB limit.
- Host compression works with any current DASD subsystem including traditional DASD and the newer RAID subsystems. An investment in host compression is a viable long-term solution.
Case Study
A stand alone benchmark was run comparing batch performance on a Hitachi RAID subsystem versus an IBM RAMAC virtual array. The benchmark emulated batch processing by loading a VSAM data set with approximately 1 million records and then reading, updating and inserting records into the dataset. Each benchmark run was based on a number of records, N, to be updated. The process was (1) read N records using random keys, (2) read and update N records using random keys, (3) insert approximately N/10 records, and (4) read the entire dataset sequentially.
The VSAM dataset was compressed by DAC when using the Hitachi subsystem. No host compression was used for the RAMAC cases.
Figure 2 recaps the result of the benchmark. A noticeable reduction in elapsed time for loading and processing the dataset can be observed, and CPU consumption was very small during the processing phase. In fact, as the size of the workload increased, the clock time improvement increased and the CPU consumption decreased using DAC Compression.
Conclusions
Surprisingly, our evaluation indicates that there are no real functional or performance benefits to using subsystem compression. While it does compress data, the compression function is not a benefit to the customer in terms of performance, throughput, tuning, capacity planning or any other meaningful data-related measure. The sole benefit to the customer is the reduced cost of the subsystem, and since the subsystem itself may degrade meaningful data-related functions, one may speculate that the value of the cost savings may be overestimated.
On the other hand, processor compression actually improves every aspect of data-related functions while consuming a measurable resource. The impacts can be predicted and measured, and traditional methods of tuning and capacity planning apply. The customer can decide what should or should not be compressed, and even what technique should be used, thereby optimizing performance and leveraging the full capability of the investment in hardware resources.
BMC Software, the BMC Software logos and all other product or service names are regis-tered trademarks or trademarks of BMC Software, Inc. IBM is a registered trademark of International Business Machines Corp. All other trademarks or registered trademarks belong to their respective companies.© 2000, BMC Software, Inc. All rights reserved.