Nissamudeen AA – BMC Software | Blogs

Redis®* Cache on Production: An Overview and Best Practices

Nissamudeen AA — Thu, 06 Oct 2022 07:25:38 +0000

In the modern landscape of complex applications, cloud-native technologies empower organizations to build and run scalable applications in public, private, and hybrid clouds.

An in-memory cache has become an essential component for developing loosely coupled systems that are resilient, manageable, and observable with microservices, containers, and immutable infra-services.

Unlike traditional databases, in-memory data stores don’t require a trip to disk, reducing engine latency to microseconds, so they can support an order of magnitude more operations and faster response times. The result is blazing-fast performance with average read and write operations taking less than a millisecond and support for millions of operations per second.

Building on earlier experience using caching solutions like Infinispan and Hazelcast, we evaluated various cloud-based and on-premises cache solutions with the following requirements:

Ability to scale out seamlessly, from a few thousand events per second to multimillion events
Support for various data types and languages
Performance metrics for monitoring
Cache/entry-level time to live (TTL) support

Based on our findings, our BMC Helix SaaS solutions leverage Redis and Redisson Client for faster, more accurate, and more efficient ways of delivering innovations for the modern enterprise. Redis, which is short for REmote DIctionary Server, is an in-memory cache data structure that enables low latency and high throughput data access.

If you’re interested in deploying Redis at your organization, keep reading for some tips and best practices based on what we’ve learned from our deployment.

In-Memory Caching Service

First, you will need an in-memory caching service that supports Redis, such as one of the cloud and on-premises in-memory caching services below:

Deployment Types

You should choose your deployment type based on your application use cases, scale, and best practices, while also considering factors such as number of caches, cache size, Pub/Sub workloads, and throughput.

Figure 1. Non-Cluster deployment with single shard contains one primary and two replica nodes.

Figure 2. Cluster deployment with three shards and each contains one primary and two replica nodes.

Sharding

A shard is a hierarchical arrangement of one to six nodes, each wrapped in a cluster, that supports replication. Within a shard, one node functions as the read-write primary node, and all the other nodes function as read-only. Below are a few key points about individual shards:

Up to five replicas per shard (one master plus up to five replica nodes)
Nodes should be deployed in a shard on multiple availability zones or data centers for fault tolerance
In case of a master node failure, one of the replicas will become the master

A production deployment may choose from three shards with three nodes per shard (one master and two replicas); each must reside on a different availability zone/data center. The cluster node type (CPU/memory) and Scale-out and Scale-in decisions are based on the cache types, size, and number of operations per second.

Every shard in a Redis cluster is responsible for a subset of the hash slots; so, for example, you may have a cluster with three replication groups (shards), as follows:

Shard 1 contains hash slots from 0 to 5500
Shard 2 contains hash slots from 5501 to 11000
Shard 3 contains hash slots from 11001 to 16383

Redis Client

We zeroed in on Redisson after evaluating the available APIs based on the use cases and data structure requirements. It provides distributed Java data structures on top of Redis for objects, collections, locks, and message brokers and is compatible with Amazon ElastiCache, Amazon MemoryDB for Redis, Azure Redis Cache, and Google Cloud Memorystore.

Redis Client Key Usages

A streaming application that processes millions of metric, event, and log messages per second has various use cases that require low-latency cache operations, which informed our choice of cache type.

RMap is a Redis-based distributed map object for the Java ConcurrentMap interface that’s appropriate for:

Use cases where short-lived caches are required
Eviction occurs at cache level and not at key/entry level
Clarity exists on the probable cache size and max insert/retrieve operations

RLocalCacheMap is a near-cache implementation to speed up read operations and avoid network roundtrips. It caches map entries on the Redisson side and executes read operations up to 45 times faster compared to common implementations. The current Redis implementation doesn’t have a map entry eviction functionality, so expired entries are cleaned incrementally by org.redisson.eviction.EvictionScheduler. RLocalCacheMap is appropriate for:

Use cases where the number of cache keys is certain and won’t grow beyond a certain limit
The number of cache hits is high
The workflow can afford infrequent cache hit misses

RMapCache is a cache object that supports eviction at key level and is appropriate for use cases that require that functionality and situations where ephemeral cache keys must be cleaned periodically.

Redis-based Multimap for Java allows you to bind multiple values per key.

Redis-based RLock is a distributed reentrant lock object for Java.

Monitoring Key Performance Indicators (KPIs)

The following KPIs should be monitored to ensure that the cluster is stable:

EngineCPUUtilization: CPU utilization of the Redis engine thread
BytesUsedForCache: Total number of bytes used by memory for cache
DatabaseMemoryUsagePercentage: Percentage of the available cluster memory in use
NetworkBytesIn: Number of bytes read from the network, monitor-host, shard, and overall cluster level
NetworkBytesOut: Number of bytes sent out from a host, shard, and cluster level
CurrConnections: Number of active client connections
NewConnections: Total accepted connections during a given period

Redis Is Single-Threaded

Redis uses a mostly single-threaded design, which means that a single process serves all the client requests with a technique called multiplexing. Multiplexing allows for a form of implicit pipelining, which, in the Redis sense, means sending commands to the server without regard for the response being received.

Production Issues

As we expanded from a few caches to several caches, we performed vertical and horizontal scaling based on the above key metrics and the cost and recommendations. One critical issue we faced was a warning about high engine CPU utilization, although the application read-write flow was unchanged. That made the whole cluster unresponsive. Scale out and vertical scaling didn’t help, and the issue repeated.

Figure 3. Engine CPU utilization of one of the shards that breached a critical threshold.

Troubleshooting Steps

Get the big keys: use redis-cli –bigkeys to scan the dataset for big keys and get information about the data types within the dataset
Review the slow log of slower queries
Get the latency details with redis-cli –latency

Key Findings

Publish and subscribe (pub/sub) operations were high on the problematic shard
One of the hash slots had a large number of keys
RMapCache seems to be the culprit

Issues with RMapCache

RMapCache uses a custom scheduler to handle the key-level TTLs, which triggers a large number of cache entry cleanups, resulting in huge pub/sub and making the cluster bus busy.

After a client publishes one message on a single node, this node will propagate the same message to other nodes in the cluster through the cluster bus. Currently, the pub/sub feature does not scale well with large clusters. Enhanced input and output (IO) is not able to flush the large buffer efficiently on the cluster bus connection due to high pub/sub traffic. In Redis 7, a new feature called sharded pub/sub has been implemented to solve this problem.

Lessons Learned

Choose cache types based on usage patterns:
- Cache without key-level TTL
- Cache with key-level TTL
- Local or near cache
For a cache with key-level TTL, ensure that the cache is partitioned to multiple logical cache units as much as possible to distribute among shards. The number of caches may grow by a few thousand without an issue. Short-lived caches with cache-level TTL are an option.

While leveraging the Redisson or other client implementations on top of Redis, be careful with the configuration and impact on the cluster.
Ensure that the value part is not a collection (if a collection is unavoidable, limit its size). Updating an entry on the collection value type has a large impact on the replication.

Conclusion

Looking to provide a real-time enterprise application experience at scale? Based on our usage and experience, we recommend that you check out Redis along with the Redisson Client.

Experience it for yourself with a free, self-guided trial of BMC Helix Operations Management with AIOps, a fully integrated, cloud-native, observability and AIOps solution designed to tackle challenging hybrid-cloud environments.

*Redis is a registered trademark of Redis Ltd. Any rights therein are reserved to Redis Ltd. Any use by BMC is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and BMC.

Monitoring Microservices with Spring Boot Actuator and AspectJ

Nissamudeen AA — Tue, 27 Jul 2021 15:55:21 +0000

What makes modern applications so different? Modern applications adhere to several essential architectural tenets: They are cloud-native, containerized, auto-scalable, microservice-based, and multi-tenant and supported by DevOps-driven deployment automation. At BMC, our application development environment solution comprises 100-plus microservices running on a containerized cloud platform (AWS), scaled using the Kubernetes platform.

Each microservice can be deployed as Kubernetes Pods (from three to 16 depending on horizontal scaling and replication requirements) spanning multiple availability zones (AZ). The services expose representational state transfer (REST) interfaces for external consumption and remote procedure calls (gRPC) for internal microservice-level communication. As more services are introduced, monitoring and managing modern applications becomes increasingly challenging.

Site reliability engineering (SRE) for 24×7 uptime

Adopting appropriate agile and DevOps practices is critical to successfully running enterprise-class solutions on a software-as-a-service (SaaS) model and keeping deployments highly available and responsive. BMC has adopted SRE practices to keep our solutions up and running 24×7. For self-observability of stacks, our SRE team uses our in-house monitoring solution. When it comes to self-observability, SRE and development teams require a lot of metrics to keep an eye on the health of the overall stack. Examples of data we monitor include:

Specific tenants in multi-tenant deployments that generate large numbers of events at particular times of day
The rate of publishing metrics to Kafka topics for further processing by artificial intelligence for IT operations (AIOps)
Log data ingestion rates and abnormally large log entries

For microservices built using the Spring Boot framework, you can enable the Spring Boot Actuator to provide many different types of metrics. For example, the Actuator exposes resource utilization metrics such as CPU and memory utilization; network and disk space utilization; and more. It also exposes other standard Java Management Extensions (JMX) metrics, including garbage collection (GC) overhead, GC time, heap utilization, thread usage/blocked threads, etc. But standard data points are not enough. Depending on the responsibility of a given microservice, you might need additional custom metrics, such as:

Message processing rate
Message drops due to communication errors
Message drops due to size limits
Average message size received
Communication failures with infra-services
Create, read, update, delete (CRUD) operations latency

Application telemetry and microservices

For both SRE and development teams, capturing telemetry is the key to seeing what’s going on inside an application. You need to build a telemetry infrastructure to capture and process the data. The two main functions of a telemetry processing infrastructure are:

Collecting and exposing telemetry data—BMC microservices need to be instrumented to expose standard and custom metrics
Processing and storing telemetry data—the BMC application development environment can ingest huge volumes of metrics data (including support for the Prometheus format) while enriching and storing it in Time Series format so it can be processed and further visualized

Below is a high-level diagram of the telemetry pipeline. For our purposes here, we will focus on how to expose standard and custom metrics for microservices built using the Spring Boot framework.

Spring Boot Actuator

The Actuator is a sub-project within the Spring Boot framework that provides production-ready features to help monitor and manage running applications built using Spring Boot. Telemetry data is exposed by the Actuator via endpoints, which in turn are exposed through HTTP or JMX. You can enable the Spring Boot Actuator by simply adding a spring-boot-actuator dependency in the package manager, as follows:





org.springframework.boot

spring-boot-starter-actuator

Doing so will generate /actuator URLs with various endpoints. Here are a few we found that are relevant for using in our application:

/metrics is the most important endpoint (provides generic and custom metrics)
/prometheus returns metrics in the Prometheus format
/health provides information about an application’s health status
/heapdump returns a heap dump from the Java Virtual Machine (JVM)
/threaddump returns JVM thread information
/loggers enables fetching and updating the logging level of applications

We are specifically interested in the /actuator/metrics endpoint, which returns useful services telemetry data like jvm.gc.live.data.size, jvm.gc.max.data.size, jvm.gc.memory.allocated, jvm.memory.max, jvm.threads.live, jvm.threads.peak, process.cpu.usage, process.files.max, etc.

Since the release of Spring Boot 2.0, metrics are managed by Micrometer support, where you interact directly with Micrometer. As there is an autoconfigured bean of type MeterRegistry, this occurs as an automatic default.

Custom metrics

To expose application or service-specific telemetry, you need to define custom metrics using the Actuator framework. This involves accessing MeterRegistry via constructor and then using various meters to capture the telemetry. Micrometer supports various meter types (e.g., counter, gauge, time, and LongTaskTimer). It also provides a dimensional approach that lets you add additional tags like customer, region, etc., to the metrics to build a Time Series for each metric you’re capturing. This allows for the aggregation of metrics as well as drilling down as needed. The disadvantage is that the business logic inside the service gets tangled up with the telemetry code.

Instrumenting microservices

Instead of modifying the business logic of each microservice, we instead take advantage of the Aspect-oriented programming (AOP) approach supported by the Spring Boot framework, which allows us to treat telemetry as a cross-cutting concern.

We prefer the AspectJ approach as it allows you to compile the time-weaving of Aspects, which provides better runtime performance. The step we found extremely useful is to annotate service methods requiring: the capture of application-specific metrics; defining the Aspect class separately; defining the Pointcuts against the methods of interest; and then registering the meters in MeterRegistry from inside the Aspect code.

Annotate service business method: Service classes with annotation.

...
@MonitoredService
public boolean createMessage(String topic, BasicMessage basicMessages) 
{
    // business logic
    ...
}

Monitoring package with Pointcuts defined: Aspect classes defining Pointcuts against methods of interest.

@Aspect

@Component

public class GatewayServiceAspect {

@Autowired

private MeterRegistry meterRegistry;

public static GatewayServiceAspect aspectOf() {

return SpringApplicationContextHolder.getApplicationContext()

.getBean(GatewayServiceAspect.class);

}

// Define Point Cut for business methods from the service class

@Pointcut("@annotation(MonitoredService) && execution(*

com.messages.EntityMessagesExecutor.createMessage(String, BasicMessage)) && args(topic,basicMessage)")

public void metricsPointcut(String topic, BasicMessage basicMessage) {

}

// Define the metric based on the After trigger for the PointCut method

// Using Meter registry, increment the counter for a specific case defined by labels

// like specific K8S POD instance, K8S deployment name, Tenant/Customer Id etc

@After("metricsPointcut(topic,  basicMessage)")

public void recordMessageCount(JoinPoint jp,String topic, BassicMessage basicMessage) {

Counter metrics_messages = meterRegistry.counter ("gateway_message_received_count",

"instance",””, "deployment_type", ””, "tenant-id", “”, "message_type", “”);

metrics_messages.increment();

}

// Define the metric based on any exception thrown by the PointCut method

@AfterThrowing(value = "metricsPointcut( topic,  basicMessage)", throwing = "ex")

public void catchCreateMesageException (Exception ex, String topic, BasicMessage

basicMessage) {

Counter gateway_metrics_processing_error_count = meterRegistry.counter(

"gateway_message_exception_count", "instance","instance",

””, "deployment_type",””, "tenant-id" , “”,"message_type", “”);

}

}

Using this development pattern keeps a microservice’s business logic clean by simultaneously providing access to the internal structures of the microservice code to enable capture of much deeper levels of application metrics. The approach also eliminates the need to maintain your own collections and caches for counting and averaging, since Micrometer already does that quite reliably.

Conclusion

By leveraging a built-in and widely used telemetry framework like Spring Boot Actuator, you can expose basic Java Management Extensions (JMX) metrics and application-specific metrics easily, with minimal overhead on application developers. By combining this approach with AOP and following the patterns described above, you can keep business logic code clean and easily and efficiently keep telemetry responsibility separate. Using the Spring Boot Actuator supplemented by AspectJ enables you to incrementally satisfy telemetry needs across various large-scale application microservices and still maintain agility in development processes.