Anuj Gupta – BMC Software | Blogs

Predictive Log Alerting with ML Anomaly Detection

Anuj Gupta — Mon, 03 Apr 2023 14:22:46 +0000

Logging is vital to the success of any IT project. With a solid logging practice, you can troubleshoot errors, find patterns, calculate statistics, and provide diagnostics information easily. Given the size and complexity of many modern systems and the fact that they’re always on with 24×7 availability, logs can rapidly become difficult to manage. This, combined with aggregating logs from multiple systems, makes it infeasible to manually process logs.

Log data also contains anomalies that represent potential system faults, which makes them critical to debugging application performance and errors. However, if you look at the logs, most of the entries simply say that “an event occurred.” What we want is a way to detect when things aren’t following the normal pattern, which means that the automated analysis needs to look at individual lines and groups of entries to determine whether they’re expected or indicate any deviation. This can help you proactively find concerns before they become a problem and help troubleshoot errors when they arise.

BMC Helix Log Analytics provides automated log analysis with machine learning (ML)-based anomaly detection to process log contents and find abnormal entries and behavior patterns in logs.

Why anomaly detection is important

Imagine you log in into your system to find that an application you manage has been running slowly. Your team updated a few patches in the last release, but that was over a week ago. There’s no reason why anything should be different now. Maybe it’s an integration that’s causing problems. Or maybe the server has a hardware issue. Whatever the case, you’re going to have to look at the logs. Without a log analytics solution, you need to go through the raw log file with Ctrl+F and some regular expression (regex). Maybe you will modify that script you tried to make last time. It didn’t quite work, but you think it sped the process up.

Keep in mind that the log has been recording a tremendous number of messages, which may take hours and days of effort to search and troubleshoot. You don’t even know what you’re looking for. Or where to look first. The problem might not even be in the error and warning messages. It might be hidden in success messages that are fired too quickly or out of order. No amount of regex will find that.

To overcome this, you will need an automated log analytics solution that can identify the entries and behaviors that don’t look like they fit. This approach may not find the problem immediately, but it’s going to give you a subset to work with—things that you can investigate further without having to dive into those 600,000+ entries manually. That’s where BMC Helix Log Analytics and its ML-based anomaly detection capability can come to the rescue.

ML anomaly detection by BMC Helix Log Analytics

BMC Helix Log Analytics anomaly detection uses ML to detect anomalies from logs and allows you to generate events that quickly alert you to impending problems in your application or system. It incorporates an unsupervised deep-learning model which is based on an artificial neural network technique and involves the following steps:

Data pre-processing
Anomaly detection
Evaluation

The first step is data pre-processing, where the raw and unstructured log data is transformed into features that can be ingested into the anomaly detection algorithm. It parses the raw data to extract key value pairs and remove extraneous or execution-specific details, and the output is used as input for the anomaly detection stage. At this stage, the ML model looks at every incoming log record, finds patterns and behaviors from log messages, their similarity and frequency of occurrence, calculates the anomaly score, and identifies records which are anomalies. Finally, these anomalies are also cross validated with the domain expert labeled list of anomalies for each dataset to identify false positives, false negatives, true positives, and true negatives in order to derive precision and recall. This helps to auto-tune, optimize, and improve the overall accuracy of the ML anomaly detection algorithm.

When the log alert policy with anomaly detection is defined, the ML model is trained on the incoming logs that are categorized as training data, per the matching criteria of the anomaly alert policy, and then a threshold value is calculated. After the ML model is built, it calculates the anomaly score for every new log record that comes in, and when its value exceeds the threshold, the log record is flagged as an anomaly and an anomaly event is generated. The ML model keeps training continuously and auto-updates if it finds new patterns or behavior in logs.

You can keep track of rare and anomalous log patterns and generate events using the anomaly alert policy, as shown below.

Figure 1. Log alert policy to generate anomaly events.

These anomaly log events are acted upon in the BMC Helix Operations Management events console and further correlated in the context of a given service for root cause isolation with BMC Helix AIOps capabilities. You can cross-launch into log analytics in the context of an anomaly event to see associated logs and troubleshoot for more information.

You can also visualize the anomalies in the log explorer and troubleshoot the probable cause, and apply a filter based on the anomaly score or query on specific conditions and further slice and dice log records.

Figure 2. Log explorer to analyze log anomalies.

To summarize, BMC Helix Log Analytics allows you to use ML-based anomaly detection on log files to help troubleshoot why processes are failing, identify whether you have any security concerns, and perform a check on your software. It’s best suited to large, complex systems that include access, runtime, development, and security logs and which generate tons of logs every minute. BMC Helix Log Analytics can run on any log at any time, including regularly in the background, to proactively find and solve concerns before they become problems, and increase the likelihood of finding the root cause of a problem. This also helps increase uptime, reduce errors, and improve system design, all of which are key to the success of your business.

Analyse Windows Event Logs to improve business performance

Anuj Gupta — Tue, 28 Mar 2023 11:27:54 +0000

In a perfect world, computers would function properly on the network at all times. There would be no issues with the operating system and no problems with the applications. Unfortunately, this isn’t a perfect world. System failures can and will occur, and when they do, it is the responsibility of system administrators to diagnose and resolve the issues. But where can system administrators begin the search for solutions when problems arise? The answer is Windows event logs.

What are Windows event logs?

At their core, Windows event logs are records of events that have occurred on a computer running the Windows operating system. These records contain information regarding actions that have taken place on the installed applications, the computer, and the system itself. Windows event logs include both actions taken by users and by processes executing on the computer. If there is an issue with the system, they can provide an administrator with crucial context for reaching a resolution.

Imagine for a moment that an application on your Windows machine fails, and you’re presented with an obscure error message that is relatively useless for identifying the cause of the problem. This is an example of an instance where the Windows event logs can be of great use. Event log files consist of log information that can help organizations reduce their exposure against malware, intruders, damages, and legal obligations.

The Event logs are listed with header information consisting of date and time, user, computer, event ID, source, and type. Type is used to identify the severity of the event. They are Information, Warning, Error, Success Audit (Security Log), and Failure Audit (Security Log).In general, Windows-based systems produce the following log types:

System: Logs regarding incidents on Windows-specific systems such as outdated hardware drivers.
Application: Logs regarding the installation of new software or hardware or currently running software.
Security: Logs regarding a Windows system’s audit policies, login attempts, and resource access.

Using BMC Helix Log Analytics

To monitor windows event logs, they need to be gathered, stored, monitored and managed by enterprises. This can be quite a tiresome job as log files come in various formats from different sources and in large numbers. Your network devices and servers produce thousands of system event log entries every day. Approximately 95% of your log files record entries of all events or transactions taking place in your system, such as user logins and server crashes. A manual check on every Windows device is tedious and impossible and warrants automated auditing and monitoring of event logs on a regular basis. Further, securing the information on your network is critical to your business to protect against attempted or successful unauthorized access.

This is where the log management solution from BMC helps, by providing a centralized and easy-to-navigate user interface to collect, parse, analyze, and visualize Windows event logs end to end and generate alerts. BMC Helix Log Analytics helps you audit, monitor, and report authorized and unauthorized file access, policy changes, and any activity involving a breach of personal information such as financial data, employee details, or patient records by monitoring these event logs.

Collecting logs

The following high-level diagram shows how Windows event logs are collected and processed for analysis using BMC Helix Log Analytics.

Figure 1. Collecting Windows event logs into BMC Helix Log Analytics.

You simply need to configure a windows event log collection policy to collect logs from remote or local windows event source via log connectors.

Figure 2. Log collection policy for windows event logs.

You can specify the channels and collection time interval to collect logs. Collection policy provides an out-of-box parser which parses these event log records without you writing complex regular expressions.

Figure 3. Configuring windows event logs.

Analyzing logs

Once logs are collected, processed and stored, you can use log explorer to search and analyze logs. You can query the logs, put filters and see time-based count of log distribution.

Figure 4. Analysing logs in Explorer.

Further you can click on any log record to slice and dice further for more meaningful information for your operations or troubleshooting needs.

Figure 5. Detailed analysis of windows event log record.

By setting up thresholds via log alerts policy, you’ll be alerted if any of the user-defined events is logged and/or if the number of error events (events with “Error” or “Critical” severity levels) equals or exceeds the set value. Further, when you can correlate log events in context of a Service Monitoring powered by BMC Helix AIOps, it makes troubleshooting easy and gets to the root cause faster and allows ITOps team to take proactive actions.

Visualizing logs

BMC Helix Log Analytics provides an out of box windows event logs dashboard which helps to visualize different attributes and their log distribution to analyze windows event logs. You can also create a custom dashboard and add other meaningful visualizations of interest. This helps to speed up the process of investigating unusual occurrences and quickly determine whether they’re a sign of a real problem.

Figure 6. Out of box dashboard for windows event logs monitoring.

BMC Helix Log Analytics is a centralized event log management platform for collecting and monitoring your Windows event logs for easier log analysis and issue investigation. It provides a detailed analysis of events in your infrastructure and log alerts keep you updated with potential threats and issues in your network so you can proactively troubleshoot problems instead of waiting for them to occur. This enables higher availability and reliability for your network, reduces downtime, and increases revenue. To learn more, refer product documentation.

Gain Network Visibility and Performance with Syslog Monitoring

Anuj Gupta — Tue, 28 Mar 2023 11:22:37 +0000

Syslog is an event-logging standard that lets almost any device or application send data about status, events, diagnostics, and more. It’s commonly used by network and storage devices to ship observability data to log analytics platforms in order to support and secure the enterprise. These log messages contain information about the operation and status of devices, as well as any errors or issues that may have occurred. Syslog monitoring is typically used to keep track of system and network events, detect security threats, and troubleshoot problems.

Syslog, which stands for system logging protocol, has been in use since 1980 and has become the standard for logging on many Unix-like systems. It can use User Datagram Protocol (UDP) or Transmission Control Protocol (TCP) for event delivery over the network.

Benefits of syslog monitoring

Improved security

Syslog monitoring can be used to identify and prevent security threats by detecting unusual activity or suspicious log messages or finding patterns of potential security attacks. For example, if a device on the network is attempting to access unauthorized resources or is behaving in a way that is outside of the normal range of activity, syslog monitoring can alert the administrator to this activity.

Enhanced network visibility

By collecting and analyzing log messages from all devices on the network, syslog monitoring can provide a comprehensive view of network activity and help administrators identify and resolve issues such as a sudden increase in traffic or a spike in errors.

Improved troubleshooting

Syslog monitoring can help administrators quickly locate problems, as well as identify and fix the root cause of issues. For example, if multiple devices on the network are generating similar log messages, it may be a sign that there is a problem with a shared component or configuration.

Enhanced compliance

Regulations such as the Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standard (PCI-DSS), and Sarbanes-Oxley Act (SOX) have specific requirements for the collection, storage, and analysis of log messages as a way to ensure the security and integrity of systems and networks. By using syslog monitoring to collect and store log messages from all devices on the network, organizations can demonstrate compliance with these requirements by providing a record of all activity on the network.

Popular use cases with syslog

Firewall monitoring

Firewall log analysis reveals a lot of information about security threat attempts at the periphery of the network and the nature of traffic coming in and going out of the firewall.

Monitoring network devices

All of your network and Internet of Things (IoT) devices generate vast amounts of data to support a number of use cases. When a device is compromised, you’ll definitely want to know what the hacker is up to, how they accessed the devices, the firmware version, and whether or not they are even operational.

Open systems logging to support operations and security

If you work in the open systems world, syslog is the most common method for getting operations and security data off that system and into your security framework.

Monitoring storage devices

Storage devices are a massive data source, especially with the growing number of security requirements being put in place, which makes storage monitoring with syslog a major security use case.

Sending alerts

Syslog alerting is beneficial in many situations where you need to be notified about events like server start up, sudden server shutdowns, broken connections, configuration reloads and failures, runtime configuration impact, resource impact, and other events. All of these alert notifications can aid in determining whether or not the servers are operational, especially when you’re responsible for hundreds of servers.

Monitoring syslog with BMC Helix Log Analytics

As an observability engineer, you need to work with syslog data in a scalable way, as this ensures that you have quality data outputs and your operations and security teams get the right data to help them do their jobs. This is where the syslog monitoring capabilities provided by BMC Helix Log Analytics help to solve your problems by providing an easy-to-navigate user interface (UI) for collecting, aggregating, analyzing, and visualizing the syslog data and sending alerts.

Collecting syslog

The data flow diagram below illustrates how syslogs are collected and configured using BMC Helix Log Analytics. The log connector collects logs from the syslog daemon server and forwards to BMC Helix Log Analytics for further processing and storage.

Figure 1. Data ingestion from syslog sources into BMC Helix Log Analytics.

BMC Helix Log Analytics provides a log collection policy to collect syslogs from one or more connectors.

Figure 2. Log collection policy for syslog.

You can configure syslog collection by providing fields like a bind address with a default value of 0.0.0.0, port with default 5140, transport UDP or TCP, and parse syslogs as per RFC 3164 and RFC 5424.

Figure 3. Syslog parsing.

Analyzing syslog

The log explorer helps you search and analyze syslogs and provide quick insights into the data. You can query the logs, apply filters, and see a time-based count of log distribution.

Figure 4. Analyzing syslog in Explorer.

Next, you can click on any log record to slice and dice further for more meaningful information for your operations or troubleshooting needs.

Figure 5. Detailed analysis of syslog record.

Syslog messages contain severity level fields that can be used to report levels of emergency and warnings in case of software or hardware issues. A system restart, for example, will be sent through the notice level. Through the informational level, system reloads will be delivered. If debug commands are sent, they are sent via the debug level. The IT administrator can create alert policies based on these severity fields, which allow IT operations team to take proactive actions.

To cut through the noise and focus on the key events that matter, you can search the logs by hostname, service, source, messages, and more. Further, when you can correlate log events in context of BMC Helix Service Monitoring powered by AIOps, it makes troubleshooting easy and gets to the root cause faster.

Visualizing syslog

BMC Helix Log Analytics provides an out of box syslog dashboard that helps to visualize syslogs. You can also drill down from the dashboard to specific data points to speed up the process of investigating unusual occurrences and quickly determine whether they’re a sign of a real problem.

Figure 6. Out of box BMC Helix dashboard for syslog monitoring.

Syslog is a very common and widespread method for transmitting data from network devices and open systems servers. Many applications support sending data to syslog because it is a standard protocol. You can quickly audit security, monitor application behavior, and keep track of other vital server information by centralizing this data.

BMC Helix Log Analytics is purpose-built to help businesses simplify syslog management so engineers can spend more time delivering business value projects and less time moving data around the enterprise. The solution is delivered as a fully managed cloud service or on-premises with minimal setup at any scale and requires no maintenance. It monitors logs from all of your systems and applications in a centralized and easy-to-navigate user interface, allowing you to troubleshoot faster. For more information, visit the BMC Helix Log Analytics documentation page.

Archive logs to optimize storage & gain full visibility

Anuj Gupta — Tue, 11 Oct 2022 15:45:03 +0000

To gain full visibility into modern cloud environments, businesses must collect an ever-growing avalanche of log data from a range of overly complex data sources. Retaining logs is key for real-time monitoring and troubleshooting, but it can quickly become expensive at high volumes, meaning that organizations must often choose which logs to index and which to archive.

With new business requirements to log everything all the time, it can be a challenge to store and analyze all this data effectively and cost-efficiently. The proliferating number of applications doesn’t help, either. Another consideration is that the value of log data can transition from high to historical in a matter of weeks or days, which presents its own challenges when the storage cost of the data outweighs its potential value as a source of business insights.

With BMC Helix Log Analytics, you can archive and retain logs for a longer period of time at a cheaper cost. Before we dive into the details of the archival feature, let us look at some of the use cases that show us why archiving is important.

Why archiving log data is required

To meet regulatory standards—For compliance reasons, archiving your logs ensures that you are fully protected. Data retention policies can vary from several months to several years, depending on the type of service you provide and the standard or regulation with which you need to comply. For instance, section 802 of the Sarbanes-Oxley Act (SOX) requires organizations to archive their data for at least seven years.
To identify patterns and trends—Logs are essential for identifying and troubleshooting short-term problems but are less effective at identifying long-term trends. Older entries may get overwritten, deleted, or lost. Archives make it easier to identify patterns over a longer period than rolling log files.
To optimize log data storage—Archiving log data by employing compression techniques and storing archived logs in a location that does not need to be optimized for quick access are effective ways to save storage space and reduce costs. Furthermore, since the data can be decompressed and loaded into active databases any time without any data loss, it can still easily be used for on-demand troubleshooting or any other operation.

Perform historical analysis and investigations

Having the ability to store and analyze enormous amounts of historical log data is vital for situations that do not necessarily need immediate query responses. These include things like running security investigations across large environments, conducting audits to adhere to strict compliance frameworks, and performing long-term analytics on high cardinality datasets.

For example, when you experience a security breach or receive a report of an insider threat, your security team will need to comb through weeks, if not months, of log events to identify malicious activity. An investigation of all the activity from a suspicious IP address may require scanning petabytes of data, assessing the timeline of activity from that IP, and generating reports for other teams (e.g., legal and executive).

Similarly, businesses operating in regulated industries—such as financial services, insurance, healthcare, and aviation—have stringent requirements around servicing audit requests from among vast amounts of historical log data.

E-commerce providers, digital content makers, sports and entertainment companies, and businesses using Internet of Things (IoT) devices frequently need to perform long-term analytics on high cardinality datasets, such as users, IP addresses, device IDs, or items purchased, among others.

The log archival solution provided by BMC Helix Log Analytics addresses these use cases by retaining logs for longer duration, and restoring them back for on-demand analysis so teams do not spend valuable time spinning up new solutions, finding data loss, or worrying about query capacity and associated costs.

Log archival solution from BMC

BMC Helix Log Analytics provides an easy-to-use solution to archive logs in multi-cloud, software-as-a-service (SaaS), and on-premises platforms to help you perform historical analysis and investigations.

The following diagram illustrates conceptual flow of log archival and restore. Logs are first saved in a hot storage for a predetermined retention period. After that, they are moved to cold storage for longer-duration retention. Logs stored in cold storage archives are not available for search. To analyze these logs, you need to restore and bring them back into hot storage. Post-analysis, the logs are then auto-archived. After the archive duration, logs are purged and unavailable for search.

Figure 1. Conceptual overview of log archival and restore

Effortless configuration and data exploration

It is easy to configure this feature in your BMC Helix Log Analytics deployment by providing the application logs to be archived and the duration for which they need to be archived. Once enabled, all the logs’ data residing in different indexes will move to cold storage following the specified hot-retention period. You can select the logs index to be used for troubleshooting and restore that data.

Figure 2. Configuration to archive and restore logs

The restored data is then available for analysis in the log explorer view, where you can query, search, and perform further action. The following diagram shows archived logs are restored and further analyzed in the log explorer.

Figure 3. Analyzing restored logs in log explorer

Once your analysis is done, you can archive the data back, or have the restored data auto-archived once the restore period is over. This capability is accessible by administrator users who manage all the archival and restore operations.

The log-archival capability of BMC Helix Log Analytics is a cost-effective, cold-storage-based solution that helps organizations retain data for historical investigation and analysis and better meet compliance and regulatory standards, while continuing to use hot storage for real-time log streaming and alerting.

To find out more about log archival and restore, check out our BMC Helix Log Analytics product documentation and watch our overview video.

AWS Cloud Observability with Log Analytics

Anuj Gupta — Mon, 10 Oct 2022 12:18:03 +0000

There are many types of logs in Amazon Web Services (AWS), and the more applications and services you run in AWS, the more complex your logging needs are bound to be. Logs originate from two primary sources—applications running on AWS services, and the AWS services themselves. An AWS centralized logging solution, therefore, becomes essential to manage this complexity. To achieve this kind of end-to-end visibility requires a conscious effort to centralize all the disparate logging data irrespective of their source of origin. AWS provides CloudWatch to centralize this data. CloudWatch is the primary collector that collects logs from different AWS services such as Amazon VPC Flow Logs, Route 53 Logs, Lambda Logs, CloudTrail Logs, and so on, in addition to log data from applications like Nginx or Apache system that you may be using in your AWS deployment.

Once collected in CloudWatch, you can use the BMC Helix Log Analytics solution to monitor and analyze logs and set up alerts. Doing this, you can get the native benefit of AWS log collection and the analytical power of the machine learning (ML)-powered observability platform provided by BMC. The ability to collect logs from CloudWatch allows you to aggregate all your log data combined with data from other sources across hybrid and multi-cloud environments. BMC Helix Log Analytics provides advanced monitoring and alerting capabilities to derive meaningful insights from logs, such as filtering by log metadata; enriching logs to add meaningful context; customizing alert messaging; automated alerting with ML-assisted anomaly detection; and service monitoring with BMC Helix Operations Management with artificial intelligence for IT operations (AIOps).

Centralized AWS logging with BMC Helix Log Analytics

The following diagram shows logs being collected from different applications and services across different AWS regions. CloudWatch collects logs by defining log streams and log groups for each region. A log stream is a sequence of log events coming from the application instance or resource being monitored. For example, a log stream may be associated with an Apache access log on a specific host. Log groups define groups of log streams that share the same retention, monitoring, and access control settings. One or more log streams belong to a log group. If you have a separate log stream for the Apache access logs from each host, you could group those log streams into a single log group called MyWebsite.com/Apache/access log.

Logs are aggregated in CloudWatch, and then collected and stored in BMC Helix Log Analytics for further analysis.

Figure 1. Collecting logs from AWS deployment

Collecting logs

BMC Helix Log Analytics uses a log collection policy to collect logs from AWS CloudWatch. To configure the collection policy, you must provide credentials to access the AWS account that contains the logs to be collected. Next, you would need to download and install the connector on the AWS platform to collect application or services logs for monitoring.

Figure 2: Log policy for collecting AWS logs

You may configure the refresh time and provide details of the region, log groups, and log streams for the logs to be collected.

You can also specify the application format to parse logs and provide log filters to include or exclude specific data. Log parsing helps to convert unstructured raw logs into meaningful key-value pairs and make them ready for analysis. It also enables you to get statistics on log message parameter values, conduct faceted searches, and filter logs by specific fields and values.

Figure 3: Configure AWS logging

Once the configuration is done, you can see the health status of the log connector and policy configured to ensure there is no error with log collection. Then you can start to analyze logs in Explorer and create alerts and other policies as appropriate.

Monitoring logs

Once the logs are collected and stored, you can further analyze them in the log explorer to search, discover, or query any log record. Insights from logging can provide a great deal of context around the behavior of your applications and services and help you troubleshoot when you are dealing with an outage.

Figure 4. Discover and search logs in log explorer

You can slice and dice a given log record to see detailed attributes and values, which helps you understand and further troubleshoot your issue.

Figure 5. Anatomy of AWS log record

Creating alerts

When things go wrong, alerts are essential for reducing response and recovery times. BMC Helix Log Analytics offers the ability to set alarms or alerts on any given condition occurring in the logs. Log events generated from these alerts can be operated in BMC Helix Operations Management, which takes proactive actions and sends notifications to you.

Figure 6. Log alert policies

Figure 7. Analyzing log events in the BMC Helix Operations Management console

Visualizing logs

Dashboards help you track the most important metrics so you are always aware of the state of the system. You can create dashboards to monitor metrics derived from your logs, and visualize the data in the form of a line chart, a stacked chart, or a numerical metric. Taking things further, you can add alarms to widgets for quick and simple monitoring. BMC Helix Log Analytics uses BMC Helix Dashboards to provide out-of-the-box log dashboards that allow you to create self-service dashboards as needed. Here is a self-service dashboard to monitor and visualize logs from an AWS deployment.

Figure 8. Self-service AWS log monitoring dashboard

BMC Helix Log Analytics is a great way to manage AWS observability with log monitoring. It acts as an advanced log monitoring solution to collect logs from AWS CloudWatch, perform monitoring and alerting, and deliver meaningful insights to improve and optimize your application. When used with monitoring metrics data from AWS and BMC Helix Operations Management with AIOps, it provides fully contextualized data about the state of your AWS services and the applications running in it.

The BMC Helix platform is an integral part of the BMC observability solution, giving SREs, DevOps engineers, and developers a seamless and streamlined workflow for IT monitoring, troubleshooting, and investigation to easily move from problem detection to resolution in minutes.

To find out more about log collection from AWS, check out our BMC Helix Log Analytics product documentation and video.

To learn more about BMC Helix Log Analytics capabilities, watch our overview video or refer to our product documentation.

Service Insights Powered by AI/ML

Anuj Gupta — Tue, 20 Sep 2022 09:20:52 +0000

The more systems you have, the harder it becomes to keep a watch of them all. When your dynamic infrastructure includes tens of thousands of hosts, containers, and services, you can’t always anticipate where an issue would originate or what impact it would have on your organization. Take for example, a critical order processing that has ground to a halt, when time is at a premium you need to troubleshoot an incident as fast as possible, and the root cause of this issue can often come from an unlikely source.

Without the right monitoring solution in place, it can be difficult to identify specific patterns on why an incident is occurring. While anomaly detection, outlier detection, and composite alerting, enables you to reliably alert on the issue, other incidents such as an increase in latency or a spike in error rates within areas of your application where you haven’t set alerts can result in significant service unavailability. Fortunately, help is on the way!

BMC Helix AIOps Service Insights feature help IT operations teams make sense of the overwhelming data and more precisely identify trends that are difficult to pinpoint. While Root cause isolation uncover events and anomalies associated with a service and provide root cause analysis, Service Insights fits into your existing workflows to make your investigations faster. Service Insights uses a new AI/ML based auto-detection engine that monitors your applications automatically and continuously analyzes data.

With Service Insights the over worked Service Operations engineers can now identify the precise time of day and day of the week when the service performance has degraded. Utilizing ML pattern recognition capability, it is easy to see when the performance degradation started for example it may have started at the same time as a scheduled backup. Service Operations engineers can now take action such as a re-scheduling of the backup, allow them to return to other business critical projects.

The initial Service Insights feature includes visibility into the periodic performance and health of your business services.

Figure 1: Service Insights based on health score

When it detects a pattern or trend, it provides a plain language summary of what happened, if the service health has improved or degraded over a given period. It also tells you the state or severity of a service and how long it has been in that critical state and if it needs immediate attention. Service Insights will also show you the health graph of that service to visualize the behavior easily. You can go back in time to discover the behavior, pattern and trends in your data for the last 15 to 30 days.

Figure 2: Insights based on service severity

Natural Language Summary for Situations

Along with providing Service Insights, BMC Helix AIOps also provide Situations. A Situation comprises events associated with a Service for same or different hosts that are aggregated based on their occurrence, message, topology, temporal relationship or a combination of these factors from across infrastructure, application, and network. Situations uses AI/ML-based event processing technique to identify event patterns from hundreds of raw events, filter out noisy events, and automatically groups similar events together.

We have provided a new feature “Situation Summary” with BMC Helix AIOps which gives a human-readable insight based on natural language processing to describe the problem and why it occurred. It helps the service operator or SRE understand the situation context easily and if it needs immediate action based on the underlying cause and severity of the problem.

Figure 3: Natural language summary for given situation

BMC Helix AIOps Service Insights speeds up your investigation workflows by surfacing parts of your systems and applications that you may not think to consider while exploring data. It builds on BMC’s established machine learning features, such as anomaly detection, noise reduction and predictions that automatically provide clues to help speed up your investigations. To find out more about Service Insights, please check out the BMC Helix AIOps 22.3 release notes and overview video

To learn more about AIOps and how it can help your organization, be sure to listen to the BMC

“Smart AIOps: Service Modeling & Root Cause Isolation” webinar.
“Smart AIOps: Service Prediction and Intelligent Automation” webinar.

Kubernetes Observability with Logs

Anuj Gupta — Thu, 15 Sep 2022 10:15:46 +0000

Application containerization has become the norm in the IT industry. The growing adoption of microservices and distributed applications gave rise to the container revolution, necessitated orchestration tooling, like Kubernetes, which helps manage the lifecycle of hundreds of containers deployed in pods. It is highly distributed, with dynamic parts, and involves several systems with clusters and nodes that host hundreds of containers that are constantly being spun up and destroyed based on workloads.

When dealing with a large pool of containerized applications and workloads, it is important to be proactive with Kubernetes monitoring and debugging errors at the container, node, or cluster level and have an observability strategy to keep track of all the dynamic components. Such a strategy allows you to see whether your system is operating as expected, and to be alerted when it isn’t. You can then drill down for troubleshooting and incident investigation, and view trends over time. Kubernetes can also simplify the management of your containerized applications and services across different cloud services, but it does add complexity by introducing new layers and abstractions, which translates to more components and services that need to be monitored. This makes Kubernetes observability even more critical.

We already provide the ability to collect metrics from Kubernetes and monitoring with BMC Helix Operations Management. In this blog post, we will focus on Kubernetes observability with logs using BMC Helix Log Analytics. The Kubernetes logging mechanism is a crucial element to manage and monitor services and infrastructure. It allows you to track errors, monitor the health of containers that host applications, and even fine-tune the performance of containers.

Why Kubernetes logging is difficult

Application logs are a great help in understanding what’s happening inside the application. They are also handy for debugging and monitoring cluster activity. Let’s look at some of the most common challenges in Kubernetes log monitoring.

Namespace logging

When all your workloads run in shared-worker virtual machines (VMs), each one that belongs to different projects is divided by namespaces. Because different projects might have their own unique logging preferences, there needs to be a new way to configure these without compromising on security.

Support logging service level agreement (SLAs)

There’s only one pod per Kubernetes worker node, and if this pod is rescheduled, it influences all the other pods in the worker node. This presents a challenge. Each node can run up to 100 pods, so you need to find a way to make sure your log monitoring solution can collect logs from all these pods. This frequently creates a noisy environment. One error might lead to more errors in the same worker node.

Layered logging

Kubernetes consists of clusters that have multiple layers like pods, nodes, and namespaces, etc., that require monitoring. And each of them produces different types of logs, each with different characteristics and priorities. You might also find different SLAs for the same layer. One can only imagine what happens when they’re all logged together. With so many layers in the Kubernetes container system, it becomes hard to handle.

Collecting all critical logs

If something goes wrong in your application, pods might be deleted and recreated quickly. What happens to the log file? Most likely, it will be lost as well. Failing to collect all the critical logs when something goes wrong will slow down your ability to solve the problem.

Kubernetes logging with BMC Helix Log Analytics

Managing Kubernetes logging manually can be difficult, but with BMC Helix, we take a different approach. Rather than trying to collect every log from across your pods and clusters, which is a tremendously difficult task to perform at scale, you can use BMC Helix Log Analytics Kubernetes log integration, which automatically collects logs for you, regardless of the format they are written in or where in your Kubernetes environment they’re stored. It lets you automate Kubernetes log collection and analysis, and avoid being overwhelmed by the complexity of Kubernetes logs. So, you can focus on gaining actionable visibility from those logs rather than struggling to figure out where each log is stored and how to collect it before it disappears.

BMC Helix Log Analytics automatically collects logs of all types from all components of your Kubernetes environment. It also eliminates the need for manual log aggregation. And by integrating with BMC Helix Operations Management with artificial intelligence for IT operations (AIOps), it allows you to analyze Kubernetes log data alongside metrics and other crucial sources of Kubernetes visibility to ensure that you gain full observability.

Collecting logs

The below diagram shows how logs are collected from a Kubernetes cluster using BMC Helix Log Analytics daemon set, which automatically collects logs from different files and locations across your node and cluster. You can also collect logs from all pods that host services running on the node within a cluster.

Figure 1. Logs are collected from a Kubernetes cluster.

BMC Helix Log Analytics provides Kubernetes connector to collect logs from your Kubernetes cluster deployment. You need to setup collection configurations, download docker connector image, upload the connector image to docker repository and install the connector on cluster nodes.

Figure 2: K8s connector for collecting logs

When you configure a Kubernetes connector, you can specify the namespaces, services, and Kubernetes metadata tags. You can also specify the application format, as well as provide log filters to include or exclude specific data.

Figure 3. Configure Kubernetes meta tags, namespaces, and format

Monitoring, analyzing, and visualizing logs

Once the logs are collected, you can further analyze them in Log Explorer to search, discover, or query any log record to get more control over your logs.

Figure 4. Discover and search logs in Log Analytics Explorer

Below is a self-service dashboard to monitor and visualize your Kubernetes deployment. Use this to keep track of the health of your Kubernetes environment and the applications running on it.

Figure 5. BMC Helix Log Analytics self-service dashboard.

In essence, don’t let the complexity of Kubernetes log management prevent you from gaining true observability of your Kubernetes clusters. Use BMC Helix Log Analytics to perform the tedious work of log collection and configuration so you can focus on analyzing logs and derive meaningful insights to improve and optimize your containerized application. When used with using BMC Helix Operations Management to monitor metrics and trace data for Kubernetes, it delivers fully contextualized data about the state of your Kubernetes cluster and the applications running in it.

BMC brings the power of the BMC Helix platform to site reliability engineers (SREs), DevOps engineers, and developers as an integral part of the BMC observability solution set. With a seamless and streamlined workflow for IT monitoring, troubleshooting, and investigation, you can easily go from problem detection to resolution in minutes.

To find out more about log collection from Kubernetes, please check out our BMC Helix Log Analytics product documentation here.

To learn more about BMC Helix Log Analytics capabilities, watch our overview video here or refer our product documentation here.

Make Your Data Smarter with Log Enrichment

Anuj Gupta — Thu, 17 Feb 2022 15:20:36 +0000

Logs are a key pillar of the underlying data that feeds an observability solution and artificial intelligence for IT operations (AIOps), so, the data and insights derived from monitoring logs is as valuable as any other data type. Effective log analysis aids understanding of a system’s performance and health to, help IT operations (ITOps) teams and site reliability engineers (SRE) identify issues as they emerge and quickly track down the cause of failures.

While log data is desirable and helps you understand what has occurred to cause a problem, it can often be cryptic, difficult to interpret and use, contain sensitive data, or lack the relevant context, all of which makes problem analysis difficult for a business analyst.

Consider a case where an SRE engineer or IT data analyst reports that a threat actor has been targeting their company’s line of business for the last three months until two weeks ago. They need to investigate whether their company data was compromised. The analyst would gather logs from multiple applications or sources, or access them from the logs repository. These could include application, firewall, network, and system logs, and more, each containing a variety of useful information for investigation.

However, the analyst cannot triage correctly without contextual information. Searching the logs by a vulnerable host’s name is not possible if the logs contain only IP addresses but no hostnames because the volatile, dynamic IP data may change every day or week, leading to incorrect and misleading summary and detail information—an issue that’s exponential when investigating a three-month span. The only way to effectively search is by capturing the host name in real time. The further away we get, from the time the logs originated, the more inaccurate the information becomes. Likewise, there can be many other examples where problem analysis is difficult as the underlying data logs lack relevant information and context to debug any issue.

BMC Helix Log Analytics enriches log data by adding necessary context in real-time for enhanced observability and diagnosis. By enabling enrichment to log data (e.g., converting IP addresses to host names), it makes log data more useful for search, analysis, and other operational needs. You can enrich logs by connecting to multiple different enrichment sources like DNS, LDAP, GeoIP, and CSV and use them to define policies.

Figure 1. Log enrichment conceptual overview

The example below illustrates how log enrichment adds meaningful context (status text) to the audit logs generated, allowing SREs and DevOps engineers to audit user transactions and logins and troubleshoot whether a user login failure is due to invalid credentials or an internal server error.

Consider a case where an application is upgraded, and as a result, a new audit log is generated that lists the users’ login status, but doesn’t tell you whether the user login is successful or failed.

Figure 2. Log data before enrichment

So, this log data needs to be enriched, and for the given example, we would use a CSV file, which maps the status code with the status description and can therefore be used to enrich the logs.

Figure 3. CSV file to enrich logs

Log enrichment is a two-step process. First, you must define and upload the type of enrichment source (CSV in this case) and then map the source field in the raw data and enrichment fields to be added to the log data.

Figure 4. Configuration for enrichment source

Once the CSV enrichment source has been defined, you must define an enrichment policy by providing the condition used to trigger it using the fields present in the logs. One or more enrichment source is then associated to the policy, providing the mapping to the source field and target enrichment fields.

Figure 5. Configuration for enrichment policy

After the enrichment policy for the audit is enabled, the logs are enriched with the “Status Text” field, which provides the audit status and meaningful context to an analyst or SRE troubleshooting the application issues. Further, the analyst may choose to create an alert and be notified whenever a user’s login status shows a failure.

Figure 6. Logs after enrichment

Applying log enrichment on log data alongside the other advanced capabilities of BMC Helix Log Analytics can be invaluable for managing, maintaining, and troubleshooting IT systems; identifying performance or configuration issues; and meeting operational objectives and service level agreements (SLAs).

To learn more about BMC Helix and BMC Helix Log Analytics capabilities, watch our overview video here or visit www.bmc.com/helix or our documentation site.

Observability with Logs to Accelerate MTTR

Anuj Gupta — Thu, 17 Feb 2022 13:55:16 +0000

Logs play a key role in understanding a system’s performance and health, helping IT operations (ITOps) teams and site reliability engineers (SREs) identify issues as they emerge and quickly track down the cause of failures. Log analytics involves deriving meaningful insights from log data, which then feeds into observability.

With DevOps and multicloud adoption, logging has become harder than ever. Architecture has evolved into microservices, containers, and orchestration infrastructure deployed across public and private clouds or in hybrid environments. Not only that, the sheer volume of data generated by these environments is constantly growing, which constitutes a challenge in itself. Long gone are the days when an engineer could simply use a Secure Shell (SSH) to log into a machine and grep a log file. This cannot be done in environments that have hundreds of containers generating terabytes of log data a day.

The advanced log management and analytics capabilities of BMC Helix Log Analytics can help by allowing DevOps, ITOps, or SREs gain the visibility they need and ensure applications are always available and performing optimally.

BMC Helix Log Analytics is part of the BMC Helix Operations Management with AIOps solution, which is built on a microservices-based architecture and available as software as a service (SaaS) on the BMC Helix platform, integrated with other services for a seamless and unified experience, and as a container-based, on-premises deployment. It provides the following key capabilities: –

Log collection
Log enrichment
Field extraction
Log analysis
Alerts and events
Root cause isolation with AIOps
Data visualization
Archive and restore

Log collection

BMC Helix Log Analytics provides log collection polices to ingest logs from different data sources or applications by leveraging open-source log connectors. It provides centralized connector management for a unified view of connectors across a distributed environment and tracks their health. Out-of-the-box log collection is available for public cloud (Amazon Web Services (AWS)), Kubernetes, Apache, syslogs, Windows event logs, and different application log files.

Figure 1. Log collection policy to collect logs.

Log enrichment

For an ITOps or DevOps engineer troubleshooting issues with logs, problem analysis can be difficult due to the lack of relevant context, which leads to an increase in the mean time to repair (MTTR). For example, if you are attempting to search the logs by a vulnerable host’s name, you may not be able to do so if the logs contain only IP addresses but no hostnames. It becomes almost impossible to reconstruct a situation because the volatile, dynamic IP data may change every hour, day, or week, leading to incorrect and misleading summary and detail information.

Log enrichment adds meaningful context to logs for enhanced observability and diagnosis. You can enrich logs by connecting to multiple different enrichment sources like DNS, LDAP, GeoIP, and CSV.

Figure 2. Logs before and after enrichment.

Field Extraction

Often, we get messages in our logs which contain a lot of useful information but are not easily readable. BMC Helix Log Analytics provides field extraction to allow you to tokenize and extract relevant fields from log messages at the time logs are ingested.

Extracted fields are then used in the log explorer to search, filter, and query logs. They can also be used with different alert or enrichment polices; to create visualizations and add them to the dashboard; and for other advanced diagnostics and troubleshooting.

Figure 3. Log record before field extraction.

Figure 4. Log record with field extraction.

Log analysis

The log explorer helps you discover and gain quick insights into your data by searching and filtering it to get information about the structure of the fields or for a given point in time. It can also create a visualization or save searches and present the findings in a dashboard.

Figure 5. Discover and search logs in log explorer.

Alerts and events

Alerts can detect issues quickly without you having to continuously monitor the dashboard. Alerts can be created for complex occurrences between many applications, which allows the ITOps team to take proactive action for the specific, tangible events that are generated.

You can also create an alert by using alert policies and defining the thresholds on the given fields and error conditions.

Figure 6. Alert-configuration.

While managing and analyzing log events, users can perform multiple actions, including notifying the end user via email. All log events are operated in the BMC Helix Operations Management portal and a user can cross-launch into BMC Helix Log Analytics to see the associated logs corresponding to that log event.

Figure 7. Analysing Log events in BHOM.

Root Cause Isolation with AIOps

If you are using BMC Helix Service Monitoring, then a log event get auto correlated with other contextual events for a service to provide root cause isolation and pinpoint the causal node. You can then click on the log event and cross-launch into BMC Helix Log Analytics to see the associated contextual logs and diagnose the issue.

Figure 8. Root Cause Isolation using log events.

Log events are also part of Situations formed on the Services, and if it is the root cause event, you can click on it to see associated logs.

Figure 9. Situations using log events.

Data visualization

You can represent log data graphically by using BMC Helix dashboards to derive valuable insights, analyze issues, and identify trends. All data is stored centrally, so it can be plotted across multiple sources to run cross-analyses and identify correlations. There are many out-of-the-box dashboards available for log monitoring like AWS, Kubernetes, syslog, Windows event logs, and more. Users can also create a custom dashboard by adding visualizations of interest.

BMC Helix dashboards provide various options to run queries and apply filters to dashboards so users can interrogate their data. You can also drill down from the dashboard to specific data points to speed up the process of investigating unusual occurrences and quickly determine whether they’re a sign of a real problem.

Figure 10. Monitoring using logs dashboard.

Archive and Restore

BMC Helix Log Analytics provides real-time storage and access for 30 days of raw log data; cold storage to retain logs for longer durations; and an option to restore data on demand for search and analysis. The archival option enables critical logs to be retained for even greater durations, which may be useful for audit, complaints, and other operational requirements.

Figure 11. Log archive and restore.

To conclude, BMC Helix Log Analytics provides a wealth of insights into the usage, health, and performance of your systems, together with a powerful and efficient set of integrated capabilities for detecting and troubleshooting issues. Not only does it simplify and accelerate the process of collating, normalizing, and parsing your log data to make it available for analysis, but it also provides advanced artificial intelligence and machine learning (AI/ML) capabilities for noise reduction and root cause isolation with BMC Helix Service Monitoring powered by AIOps.

BMC Helix Log Analytics leverages ML to keep pace with your systems and data as they evolve and ensures that you get the maximum value from your logs. This in turn helps to free up your ITOps and SRE teams to focus on investigating true positives and making targeted improvements to their platform and infrastructure.

To learn more about BMC Helix and BMC Helix Log Analytics capabilities, watch our overview video here or visit www.bmc.com/helix or our documentation site.

Anuj Gupta – BMC Software | Blogs

Predictive Log Alerting with ML Anomaly Detection

Why anomaly detection is important

ML anomaly detection by BMC Helix Log Analytics

Related Content

Analyse Windows Event Logs to improve business performance

What are Windows event logs?

Using BMC Helix Log Analytics

Collecting logs

Analyzing logs

Visualizing logs

Related Content

Gain Network Visibility and Performance with Syslog Monitoring

Benefits of syslog monitoring

Improved security

Enhanced network visibility

Improved troubleshooting

Enhanced compliance

Popular use cases with syslog

Firewall monitoring

Monitoring network devices

Open systems logging to support operations and security

Monitoring storage devices

Sending alerts

Monitoring syslog with BMC Helix Log Analytics

Collecting syslog

Analyzing syslog

Visualizing syslog

Related Content

Archive logs to optimize storage & gain full visibility

Why archiving log data is required

Perform historical analysis and investigations

Log archival solution from BMC

Effortless configuration and data exploration

Related content

AWS Cloud Observability with Log Analytics

Centralized AWS logging with BMC Helix Log Analytics

Collecting logs

Monitoring logs

Creating alerts

Visualizing logs

Related content

Service Insights Powered by AI/ML

Natural Language Summary for Situations

Kubernetes Observability with Logs

Why Kubernetes logging is difficult

Namespace logging

Support logging service level agreement (SLAs)

Layered logging

Collecting all critical logs

Kubernetes logging with BMC Helix Log Analytics

Collecting logs

Monitoring, analyzing, and visualizing logs

Related content

Make Your Data Smarter with Log Enrichment

Related Content

Observability with Logs to Accelerate MTTR

Log collection

Log enrichment

Field Extraction

Log analysis

Alerts and events

Root Cause Isolation with AIOps

Data visualization

Archive and Restore

Related Content