Machine Learning & Big Data Blog – BMC Software | Blogs

R2 Score & Mean Square Error (MSE) Explained

BMC Software — Thu, 24 Jul 2025 00:00:43 +0000

Today we’re going to introduce some terms that are important to determining the accuracy and quality of regression models.

Variance
r2 score
Mean square error

We illustrate these concepts using scikit-learn.

(This article is part of our scikit-learn Guide. Use the right-hand menu to navigate.)

How regression models are used

Understanding these metrics helps to determine whether regression models are accurate or misleading. Using the statistical technique of regression helps highlight patterns, trends, and relationships in data that provide insight into the future. You can see how independent variables affect dependent variables and the strength of those relationships.

The ability to see into possible futures and understand connections and dependencies is invaluable to planning for various scenarios, as well as assessing and addressing risks. Various high-stakes industries make extensive use of regression models, including:

Finance
Business analytics
Engineering
Healthcare
Science
Sports
Computer science

Why R2 score is important in machine learning

The value of any regression model depends on its accuracy, specifically how well it explains data variances. That is why improving R2 score is so vital. Machine learning and AI work together to identify the right variables, fine-tune models for improved predictions, and capture hard-to-identify relationships in massive data sets that are complex and nonlinear.

AI and machine learning improve how a model fits with or explains data variations, making it more accurate in making predictions. It reduces underfitting (missing key relationships) and overfitting (which brings in meaningless noise). It helps in choosing just the right variables, avoiding those that bring bias or wasteful redundancy to a model. The goal is an efficient model that produces accurate results and supports better predictions.

Learn more in Bias-Variance Tradeoff in Machine Learning.

Looking at examples brings these concepts to life for better understanding. Let’s use the code from our last blog post, and add additional logic. In the last blog post, we made the independent y and dependent variable x correlate to illustrate the basics of how to do linear regression with scikit-learn. This time, we’ll introduce some randomness in the dependent variable (y) so that there is some error in our predictions.

Getting started with AIOps is easy. Learn how you can manage escalating IT complexity with ease! ›

What is variance?

In terms of linear regression, variance is a measure of how far observed values differ from the average of predicted values (i.e., their difference from the predicted value mean). The goal is to have a value that is low, as quantified by the R2 score (explained below).

In the code below, this is np.var(err), where err is an array of the differences between observed and predicted values and np.var() is the numpy array variance function.

What is the R2 score?

The R2 score is “…the proportion of the variance in the dependent variable that is predictable from the independent variable(s)”, according to Wikipedia. A mathematical definition is “(total variance explained by model) / total variance.” So if it is 100%, the two variables are perfectly correlated (i.e., no variance at all).

The R2 score varies between 0% and 100%. It is closely related to the MSE (see below), but not the same.

In most cases, a low value shows a low level of correlation, meaning a regression model that is not valid.

Reading the code below, we do this calculation in three steps to make it easier to understand. Firstly, g is the sum of squared differences between the predicted values and the actual observed values: (ytest[i] – preds[i])². Separately, the total sum of squares is calculated using the squared difference between each observed value and the mean of the observed values: (ytest[i] – np.mean(ytest))². The results are then printed as follows:

print ("total sum of squares", y)
print ("ẗotal sum of residuals ", g)
print ("r2 calculated", 1 - (g / y))

Our goal here is to explain. We can of course let scikit-learn to this with the r2_score() method:

print("R2 score : %.2f" % r2_score(ytest,preds))

What is mean square error (MSE)?

Mean square error (MSE) is the average of the square of the errors. The larger the number the larger the error. Error in this case means the difference between the observed values y1, y2, y3, … and the predicted ones pred(y1), pred(y2), pred(y3), … We square each difference (pred(yn) – yn)) ** 2 so that negative and positive values do not cancel each other out.

How to calculate MSE in Python

So here is the complete code for calculating MSE:

import matplotlib.pyplot as plt

from sklearn import linear_model

import numpy as np

from sklearn.metrics import mean_squared_error, r2_score

reg = linear_model.LinearRegression()

ar = np.array([[[1],[2],[3]], [[2.01],[4.03],[6.04]]])

y = ar[1,:]

x = ar[0,:]

reg.fit(x,y)

print('Coefficients: n', reg.coef_)

xTest = np.array([[4],[5],[6]])

ytest =  np.array([[9],[8.5],[14]])

preds = reg.predict(xTest)

print("R2 score : %.2f" % r2_score(ytest,preds))

print("Mean squared error: %.2f" % mean_squared_error(ytest,preds))

er = []

g = 0

for i in range(len(ytest)):

print( "actual=", ytest[i], " observed=", preds[i])

x = (ytest[i] - preds[i]) **2

er.append(x)

g = g + x

x = 0

for i in range(len(er)):

x = x + er[i]

print ("MSE", x / len(er))

v = np.var(er)

print ("variance", v)

print ("average of errors ", np.mean(er))

m = np.mean(ytest)

print ("average of observed values", m)

y = 0

for i in range(len(ytest)):

y = y + ((ytest[i] - m) ** 2)

print ("total sum of squares", y)

print ("ẗotal sum of residuals ", g)

print ("r2 calculated", 1 - (g / y))

Results in:

Coefficients:

[[2.015]]

R2 score : 0.62

Mean squared error: 2.34

actual= [9.] observed= [8.05666667]

actual= [8.5] observed= [10.07166667]

actual= [14.] observed= [12.08666667]

MSE [2.34028611]

variance 1.2881398892129619

average of errors 2.3402861111111117

average of observed values 10.5

total sum of squares [18.5]

ẗotal sum of residuals [7.02085833]

r2 calculated [0.62049414]

You can see by looking at the data np.array([[[1],[2],[3]], [[2.01],[4.03],[6.04]]]) that every dependent variable is roughly twice the independent variable. That is confirmed as the calculated coefficient reg.coef_ is 2.015.

What is a good Mean Squared Error (MSE)?

There is no correct value for MSE. Simply put, the lower the value the better and 0 means the model is perfect. Since there is no correct answer, the MSE’s basic value is in selecting one prediction model over another.

Similarly, there is also no correct answer as to what R2 should be. 100% means perfect correlation. Yet, there are models with a low R2 that are still good models.

Interpreting r2 and MSE together

Our take away message here is that you cannot look at these metrics in isolation in sizing up your model. You have to look at other metrics as well, plus understand the underlying math. We will get into all of this in subsequent blog posts.

Additional Resources

Extending R-squared beyond ordinary least-squares linear regression from pcdjohnson

Writing SQL Statements in Amazon Redshift

BMC Software — Fri, 18 Jul 2025 00:00:00 +0000

In this tutorial, we show how to write Amazon Redshift SQL statements. Since this topic is large and complex, we start with the basics.

This tutorial will show you how to:

Use the query editor
Aggregate rows using group by
Convert dates to year and month
Export the results to a csv file

What is Amazon Redshift?

Amazon Redshift is an Amazon Web Services cloud data warehouse that is SQL-based. It is designed to handle large data sets and complex queries. It is ideal for data warehousing, business intelligence, big data analytics, machine learning, and Extract-Load-Transform workflows.

The components of Redshift include the following:

Clusters are computer nodes organized around a leader node to manage connections and coordinate queries.
Slices are divisions of nodes that allocate memory, processing, and disk space resources for efficient, massively parallel processing.
Data distribution methodologies allocate data to nodes evenly, according to a key value or across every node.
Columnar storage of data is done in vertical stacks to reduce input/output (I/O) and for more efficient storage.
The execution engine supports SQL queries and plans the best way to get and use data.
Spectrum is an Amazon Redshift service that makes it possible to query data in Amazon S3 without bringing it into the Redshift environment.
Integrated data loading uses a variety of Amazon and third-party tools.
Backup and recovery add resilience.
Security controls identity and access, providing logging for pattern recognition and post-exploit responses.
Maintenance and monitoring is intrinsic to this fully managed service, relieving your team of ongoing administrative burdens.

Redshift query editor

To open the query editor, click the editor from the clusters screen. Redshift will then ask you for your credentials to connect to a database. One nice feature is there is an option to generate temporary credentials, so you don’t have to remember your password. It’s good enough to have a login to the Amazon AWS Console.

Below we have one cluster which we are resuming after having it in a paused state (to reduce Amazon billing charges).

You write the SQL statement here. Only one statement is allowed at a time, since Redshift can only display one set of results at a time. To write more than one statement click the plus (+) to add an additional tab.

When you run each query, it takes a few seconds as it submits the job and then runs it. So, it’s not instantaneous, as you might expect with other products.

The results are shown at the bottom where you can export those as a CSV, TXT, or HTML. You can also chart the results.

Get table schema

For this tutorial, we use a table of weather data. (See more on loading data to Amazon Redshift from S3.) This is 20 years of weather data for Paphos, Cyprus. It has four columns:

dt_iso
temp
temp_min
temp_max

dt_dso is of type timestamp and is the primary key. One nice thing about Redshift is you can load the date in almost any format you want, and Redshift understands that. Then Redshift provides the to_char() function to print out any part of the date you want, like the hour, year, minute, etc.

To look at the table schema query the pg_table_def table.

SELECT *

FROM pg_table_def

WHERE tablename = 'paphos'

AND schemaname = 'public';

Here is the schema.

schemaname,tablename,column,type,encoding,distkey,sortkey,notnull

public,paphos,dt_iso,timestamp without time zone,none,t,1,t

public,paphos,temp,real,none,f,0,f

public,paphos,temp_min,real,none,f,0,f

public,paphos,temp_max,real,none,f,0,f

Aggregate SQL statements

This query calculates the average temperature per month for the summer months May through September. Notice:

to_char() extracts any portion of the date that you want, such as YYYY year or MM month number.
We use the in() statement to select the months.
The order statement uses a 1. That means use the first column returned by the query. That’s an alternative to typing the column name.
We group by the year and month since we want to calculate the average [avg()] for month within the year
We use the round() function to round two decimal places. Otherwise Redshift gives too many decimal places.
As with other databases, the as statement lets us give an alias to the column resulting from the calculating. Without it the column would not have a descriptive name. Here we call the average temperature aveTemp.

select round(avg(temp),2) as aveTemp,

                 to_char(dt_iso,'YYYY') as year,

                 to_char(dt_iso,'MM') as month

                 from paphos where

                 month in ('05','06','07','08','09')

                 group by year, month

                 order by 1 desc

Here are the results. It shows the hottest months for the 20 years of data. I have cut off the display to make it short. For example, in the 20 years, August 2010 was the hottest month.

We grouped by year then month as we want the month within the year given daily weather observation.

avetemp	year	month
84.11	2010	8
83.12	2012	8
83.05	2012	7
82.9	2015	8
82.39	2017	7
82.04	2014	8
81.85	2007	7
81.73	2020	9
81.72	2013	8
81.72	2008	8
81.62	2000	7
81.61	2009	8
81.49	2017	8

We export the data to a csv format using the button to the right of the results. Then we import it to a spreadsheet so that we can more easily see the results and give it colors and such.

Here are the hottest years. We get that by dropping the month from the aggregation.

select round(avg(temp),2) as aveTemp,

                 to_char(dt_iso,'YYYY') as year 

                                    from paphos 

                                                group by year

                 order by 1 desc

Additional resources

For more tutorials like this, explore these resources:

BMC Machine Learning & Big Data Blog
AWS Guide, with 15 articles and tutorials
How To Import Amazon S3 Data to Snowflake
How To Connect Amazon Glue to a JDBC Database
Amazon Braket Quantum Computing: How To Get Started

Streaming Data Explained: Benefits, Architecture and Challenges

BMC Software — Wed, 02 Jul 2025 00:00:56 +0000

Data streaming is data that continuously flows from a source to a destination to be processed and analyzed in near real-time.

What is data streaming?

Companies can have thousands of data sources that get piped to different destinations. The data can be processed using stream processing techniques, and generally consists of small chunks of data.

Streaming data allows pieces of data to be processed in real or near real-time. The two most common use cases for data streaming:

Streaming media, especially video
Real-time analytics

Data streaming used to be reserved for very select businesses, like media streaming and stock exchange financial values. Today, it’s being adopted in every company. Data streams allow an organization to process data in real-time, giving companies the ability to monitor all aspects of its business.

The real-time nature of the monitoring allows management to react and respond to crisis events much quicker than any other data processing methods. Data streams offer a continuous communication channel between all the moving parts of a company and the people who can make decisions.

Streaming media

Media streaming is one example. It allows a person to begin watching a video without having to download the whole video first.

This allows users to begin viewing the data (video) sooner, and, in the case of media streaming, prevents the user’s device from having to store large files all at once. Data can come and go from the device as it is processed and watched.

Real-time analytics

Data streams enable companies to use real-time analytics to monitor their activities. The generated data can be processed through time-series data analytics techniques to report what is happening.

The Internet of Things (IoT) has fueled the boom in the variety and volume of data that can be streamed. Increasing network speeds contribute to the velocity of the data.

Thus we get the widely accepted three V’s of data analytics and data streams:

Variety
Volume
Velocity

Paired with IoT, a company can have data streams from many different sensors and monitors, increasing its ability to micro-manage many dynamic variables in real-time.

From a chaos engineering perspective, real-time analytics is great because it increases the company’s ability to monitor the company’s activities. So, if equipment were to fail, or readings were to send back information that needed quick action, the company has the information to act.

Data streams directly increase a company’s resilience.

Benefits of streaming data

Organizations enjoy advantages when they can stream data, instead of waiting for traditional batch processing. Getting an uninterrupted flow of information in real time can make for faster decisions. Instead of reacting, executives can get ahead of trends, customer demands, and operational irregularities. By streaming data into your processes, your organization can be more agile, resilient, and innovative. Benefits of streaming data include:

Real-time insight: You don’t have to wait for a batch process to get data to analyze. You can use it right away to make time-sensitive decisions. Detect fraud immediately, jump on a consumer trend, and spot a quality issue before it evolves into a bigger problem.
Enhanced operational efficiency: Slowdowns, process inefficiencies, and weak links are easier to spot in workflows when you can analyze continuous data in real time. Not only can you boost productivity and trim costs, you can also prevent problems and downtime.
Predictive analytics: Streaming data into predictive models strengthens your ability to look ahead and make decisions. For example, you can forecast demand so that you can optimize inventory or monitor equipment to prevent failures.
Personalized customer experiences: The ability to analyze customer behavior in real time provides insights to fuel product recommendations, offer relevant promotions, and engage in ways that drive loyalty.
Better risk management: Live data makes it possible to detect irregular trends and patterns that could indicate fraud as they emerge. You may also be able to detect anomalies that could signal a coming supply chain disruption, giving you time to make adjustments.
Improved agility and responsiveness: Streaming data empowers teams to make fast and precise changes in response to evolving realities. Having real-time data helps you go in the right new direction.
Shared company-wide perspectives: When all departments can see what is happening when it is happening, it is easier to coordinate data-driven strategies and responses. No one has to wait for IT to generate reports or to worry that one department is operating in a data silo.

Data architecture for streaming data

Streaming data architecture involves creating systems to ingest and use real-time data from multiple sources, including IoT devices, transaction systems, social media, application logs, and more. Streaming data architecture supports real-time processing and analysis so users can make decisions without waiting for batch processing to give them the information they need.

A successful streaming data architecture ensures that its components work together efficiently. It has to scale to handle high-velocity and high-volume data without latency or failure. Current best practices for performance and reliability include modular design, distributed computing, and event-driven processing. The key components are:

Data ingestion layer: This component captures data in real time using tools like Apache Kafka, AWS Kinesis, or RabbitMQ. They handle high-velocity and high-volume data from sensors, applications, and services, with high throughput and low latency.
Processing layer: Processing has to match the volume and velocity of the data ingestion component to process events, detect patterns, apply business rules, and support immediate action. Stream processing frameworks that fit the bill today include Apache Flink and Apache Spark Streaming.
Storage layer: How streaming data is stored ensures that it not only fuels real-time analysis, but that it can also be used for future analysis. Time-series storage solutions like InfluxDB and NoSQL databases like Cassandra or MongoDB can be optimized for fast read and write, and make data quickly accessible for future retrieval and queries.
Data output layer: This component ensures that processed data can feed into real-time reporting tools, data visualization dashboards, and other downstream applications. It supports the instant and accurate decision-making that defines agility and responsiveness.

Challenges with data streaming

Data streams offer continuous streams of data that can be queried for information.

Generally, the data will need to be in order, which is sometimes the point of having a stream. (After all, any messaging app needs to have all the messages in order.)

Because data may come from different sources, or even the same source, but it moves through a distributed system, it means the stream faces the challenge of ordering its data and delivering to its consumer.

So data streams directly encounter the CAP theorem problem in its build. When choosing a database or a particular streaming option, the data architect needs to determine the value between:

Having consistent data, where all the reads received are the most recent write, and, if not return an error.
Having highly available data, where all the reads contain the data, but they might not be the most recent.

Data streaming technologies

Streaming data architecture built to handle high-volume, high-velocity data makes it possible to process data, gain insights, and take action on information as it happens. Cloud-based and modular data streaming technologies support the foundational principles of real-time data processing, including scalability, fault tolerance, and minimal latency.

Teams already familiar with cloud environments can access a robust suite of tools covering every layer of streaming architecture from Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). With a flattened learning curve, teams with cloud expertise can concentrate on customizing and optimizing the architecture to best support organizational needs.

The first step in building out your streaming data architecture is setting up the data ingestion layer or stream processor. Common tools to handle diverse and distributed data sources with high throughput and low latency include:

Amazon MSK
Amazon Kinesis
Apache Kafka
Google Pub/Sub

The second step is to choose and configure the stream processing and analysis layer that performs on-the-fly data transformations, aggregations, and filtering. You will also need to apply business logic, make queries, detect anomalies, track trends, trigger alerts, and make data actionable in real time. You can use tools like:

Amazon Kinesis Data Analytics
Google Dataflow
Google BigQuery

Third, you need to get the analyzed data to decision-makers in a useful form. Typically, you will use a dashboard, alert system, or an operational application that gives teams the ability to visualize data and capture insights. You can customize a data destination or use tools like Amazon QuickSight or Looker.

Finally, you need to store the streamed data somewhere. The cost of storage is cheap, so the general practice is to keep everything. When it comes to data storage, the common belief is that if it’s not useful now, it may be later.

For storing streaming data, these are good options:

Amazon Redshift
Kafka
Amazon S3
Google Storage

Popular streaming data technologies

A growing list of data streaming technologies that power robust and scalable real-time analytics includes:

Apache Kafka: This distributed event streaming platform offers fault-tolerant data ingestion with high throughput. Kafka’s strength is handling large data volumes for event-driven applications.
AWS Kinesis: This fully managed cloud service offers components like Kinesis Data Streams, Firehose, and Data Analytics to give you end-to-end streaming capabilities for collecting, processing, and analyzing data in real time.
Apache Flink: This high-powered stream processing framework handles both unbounded and bounded data streams with low latency.
Apache Spark Streaming: As part of the Apache Spark ecosystem, this component supports scalable, high-throughput, and fault-tolerant data stream processing. It is a versatile choice that supports hybrid architectures that include batch processing.
RabbitMQ: This message broker uses message queuing to support real-time communication between services, with reliable message delivery and flexibility.

Leveraging data streaming in your organization

Data streaming technologies and architectures are critical to your organization’s ability to make informed decisions in real time. Speed, agility, and responsiveness are all strategic advantages.

Instant, data-driven decision making is the competitive edge for uncovering new opportunities, optimizing operations, reducing risks, and spurring innovation. Data streaming enables your organization to work smarter and faster, for agile responsiveness and proactive decision-making.

Snowflake vs Redshift: What’s The Difference & How To Choose?

BMC Software — Tue, 01 Jul 2025 00:00:36 +0000

Snowflake and AWS Redshift are two popular cloud-based data warehousing platforms that offer outstanding performance, scale, and business intelligence capabilities.

In this article, we’ll help you decide whether AWS Redshift or Snowflake is right for you. We’ll compare these two solutions based on their similarities, differences, and use cases. We also highlight how each platform addresses common challenges faced by businesses looking to implement a data warehouse.

Choose the right data warehouse

AWS Redshift and Snowflake are two popular data warehouses. How do you know which one’s right for you? First, let’s briefly look at data warehouses in general.

Data warehouses (DWH) are large repositories of data, collected from different data sources, which organizations typically use for analytical insights and business intelligence. An efficient DWH relies on an architecture that offers consistency by collecting data from different operational databases, then applying a uniform format for easier analysis and quicker insights.

One of the fundamental purposes of DWH is to enable quick access to historical data and context, thereby helping decision makers to optimize strategies and improve bottom lines.

Implementing the right data warehousing solution is key to gaining a competitive advantage in today’s data-centric business world. When leveraging an efficiently provisioned business intelligence framework, a data warehouse supports business outcomes such as:

Increased bottom line
Efficient decision making
Enhanced customer service
Improved analytics

The most important characteristics of any efficiently designed data warehouse is that it:

Has consistent schemas across different tables that return expected results against a query.
Supports multi-table querying out of the box, which allows users to generate ad hoc reports without writing custom code or creating custom table views.

Some key factors to consider when selecting a warehousing platform include:

Business goals
Cost models
Simplicity of integration
Cloud readiness
Adherence to security and compliance standards

Snowflake vs AWS Redshift

Both platforms offer similar core functionalities, such as:

Relational management
Security
Scalability
Cost efficiency

The key differences, however, are their pricing models, deployment options, and user experience.

What is Snowflake?

Snowflake is a cloud-based, software as a service (SaaS) data platform that allows:

Secure data sharing.
Unlimited scaling.
A seamless multi-cloud experience.

The platform relies on a virtual warehouse framework that leverages third-party cloud compute resources such as AWS, Azure, or GCP. The option to choose high-performance cloud platforms allows real-time auto scaling to organizations who are looking to run faster workloads and process large query volumes on the elastic cloud.

As compared to legacy DWH solutions, Snowflake offers a non-traditional approach to data warehousing by abstracting compute from storage. That means data can reside in a central repository, while compute instances are sized, scaled, and managed independently.

Snowflake manages all aspects of data administration for a simpler, more flexible warehousing solution that provides various capabilities of enterprise offerings.

The Snowflake analytics platform leverages a custom SQL query engine and three-layer architecture to support real-time analytics of streaming big data. Its flexible architecture allows users to build their own analytical applications without having to learn new programming languages.

(Check out our Snowflake Guide.)

Benefits of Snowflake

When to use Snowflake

Snowflake is considered the perfect data warehouse solution for situations when:

The query load is expected to be lighter.
The workload requires frequent scaling.
Your organization requires an automated, managed solution with zero operational overhead to manage the underlying platform.

Now, let’s turn to Redshift.

What is AWS Redshift?

AWS Redshift is a DWH platform that uses cloud-based compute nodes to enable large-scale data analysis and storage. The platform employs column-oriented databases to connect business intelligence solutions with SQL-based query engines. By leveraging PostgreSQL and Massive Parallel Processing (MPP) on dense storage nodes, the platform delivers quick query outputs on large data sets.

Redshift offers faster query processing and multiple options for efficient management of its clusters. These include:

Interactively using the AWS CLI or AWS Redshift Console
Amazon Redshift Query API
AWS Software Development Kit

AWS Redshift is a fully managed warehousing platform that allows organizations to query and combine petabytes of data with optimized price performance. The Advanced Query Accelerator (AQUA) offers a cache that boosts query operation performance by up to 10 times, allowing businesses to gain new insights from every data point in the application/system.

(Explore our hands-on AWS Redshift Guide.)

Benefits of AWS Redshift

When to use Redshift

AWS Redshift is considered the perfect data warehouse solution for situations when:

Your organization is already using AWS services.
Workloads run structured data.
The application has a high query load.

A quick comparison of AWS Redshift vs. Snowflake

Let’s look at the clear differences between the two options.

Feature	Snowflake	AWS Redshift
Maintenance	Complete SaaS offering with no maintenance required	Requires some manual maintenance
Compute and Storage	Separates compute from storage, allowing flexible pricing and configuration	Cost optimization through Reserved/Spot instance pricing
Scaling	Instantaneous auto-scaling	Requires addition/removal of nodes for scaling
Data Customization	Supports fewer customization options	Supports more flexibility through features like partitioning and distribution
Security	Always-on encryption with strict security enforcement	Offers a flexible and customizable security model

Similarities between Snowflake and Redshift

Both support MPP for faster performance.
Both the platforms connect BI solutions to databases using column-oriented databases.
Data in both warehouses is accessed using SQL-based query engines.
Both Snowflake and Redshift are designed to abstract data management tasks, so users can easily gain insights and improve system performance using data-driven decisions.

Choosing Snowflake or Redshift

In the modern data-driven world, DWH solutions allow organizations to store large sets of operational data and make holistic analytical decisions to improve system performance.

DWH are designed to store vast amounts of structured or semi-structured data, to provide fast retrieval times and easy analytics.

Redshift and Snowflake are two top cloud-based data warehouses that offer powerful data management and analysis options. Both the platforms also offer high availability with minimal downtime and scalability through replication across multiple servers.

While both the platforms are highly popular, the choice between the two platforms depends on business demands, resources, bundled services, and specific use cases.

What Is Data Architecture? Components, Principles & Examples

BMC Software — Thu, 19 Jun 2025 00:00:41 +0000

Data architecture is a framework for how IT infrastructure supports your data strategy. The goal of any data architecture is to show the company’s infrastructure, including how data is acquired, transported, stored, queried, and secured.

Data architecture is the foundation of any data strategy.

AI technology is radically changing data infrastructures, specifically data architecture and strategies for handling data. Data architecture defines how your organization captures data, how it’s stored and managed, and how that data is used. AI applications demand better ways to handle massive volumes of data, as well as increases in computational capacity.

To handle sophisticated AI applications, your data infrastructure must support agility, both for rapidly changing business demands and to handle the fast pace of AI innovation. Your data architecture has to be highly efficient, resilient, and strong, and it must also offer scalability.

How can you achieve these requirements?

In this article, we’ll look at:

Data architecture definition
Architecture components
Common data architecture frameworks
Data standards
The shift to new architecture

Let’s get started.

What is data architecture?

Data architecture is the structure and organization of how you acquire data, store it, and manage it, and ultimately how your systems access and use it. Data architecture components include data models, rules and policies, data access and security technologies, and analytical processes and outputs.

Data architecture resolves the “how” for implementing your data strategy.

Data architecture examples

Different data architecture examples include:

Storing a file as a .csv on a local hard drive and reading the file into Tableau on a person’s computer for analysis.
Streaming data from a set of point-of-sale registers to accounting.
Accumulating data in a large-scale data lake and then using big data tools like Spark or Hadoop to process and analyze it.
Capturing data and placing it where it can be managed by various business units on one platform.
An enterprise data architecture combines everything from .csv files to data lakes and warehouses to streaming data, using data integration frameworks and business intelligence tools.

Why is data architecture important?

The data architecture is 100% responsible for increasing a company’s freedom to move around the world.

If agility is what is needed to avoid collapse during slow seasons or to capitalize on the spontaneous popularity of a new product, the more advanced the data architecture is, the more capable the company is to take action.

Explicitly, data architecture is important because it:

Gives a fuller picture of what is happening in the company
Creates a better understanding of the company’s data
Offers protocols by which data moves from its source to being analyzed and consumed by its destinations
Ensures a system is in place to secure the data
Grants all teams the ability to make data-driven decisions

Key components of data architecture

The architectural components of today’s data architecture world are:

Data pipelines: Refers to the methods used to bring raw data into a data store, typically with some transformation or processing.
Cloud storage: This model for gathering and keeping data relies on remote devices that you can access via a network.
Application programming interfaces (APIs): This set of rules provides existing functions for connecting to, communicating with, and sharing among software.
AI & ML models: These sets of programs find patterns in data to make decisions or predictions to solve tasks.
Data streaming: Refers to continuously transferring data from its source or sources for use in processing into outputs.
Kubernetes: This open-source system automates deploying, scaling, and managing applications in containers for efficiency.
Cloud computing: Involves providing computing services on remote devices that are accessed and managed over the internet.
Real-time analytics: Uses data, software, and hardware to analyze data as soon as it is generated.

Common data architecture frameworks

A data architecture framework is a structured approach to defining your data strategy, including how to organize data, process it, analyze it, and document it.

The Open Group Architectural Framework (TOGAF): A modular approach for creating a hierarchy and content framework that eliminates redundancy and inefficiency while boosting data usability.
Data Management-Body of Knowledge (DAMA-DMBOK2): Applies best practices for data governance, quality, and security.
The Zachman Framework: Provides a logical matrix structure to support both automated and manual systems for aligning the IT department with business goals.

What are data standards?

Data standards are the overarching standards of a data architecture, which you apply to areas such as data schemas and security.

Data schemas

A data schema defines how you organize data within a database, including specifying its format, relationships, and standards for storage and access. The data schema spells out:

Each entity that should be collected. The Schema for contact info, for example, might include name, phone number, email, and place of work.
The type of data each piece should be. For example, name is text data, phone number is integer data, email is text data, and place of work is text data.
The relationship of that entity to others in the database, such as where it comes from and where it’s going.

Most companies update their data schema around changing business needs, applications, and data models. As data becomes increasingly pervasive, companies are shifting away from on-premise databases to scalable cloud-native relational databases.

You can easily add data and combine data from a network of data sources into today’s relational (NoSQL) databases without being restricted to a fixed hierarchy. Plus, these relational databases can grow much larger and handle adding data dynamically through integrations with analytics tools that are not possible with traditional SQL databases.

Updating and modifying your data schema, or “versioning” it, is vital. Versioning the data schema helps standardize what to find, where, and the ability to ask when a data set was in a location.

(Explore data storage from database to warehouse to lake and from hot to cold.)

Data security

Data standards also help set the security rules for the architecture. These can be visualized in the architecture and schema by showing what data gets passed where, and, when it travels from point A to point B, how the data is secured.

Security protocols can include:

Encrypting data during travel
Restricting access to individuals
Anonymizing data to decrease the value of the information upon receipt by receiving party
Additional actions

Shifting to new architecture

AI is driving data architecture trends, reflecting the need for processing data in real time, handling massive volumes of data from diverse sources in a multiplicity of formats, and supporting highly sophisticated queries and analytics. Trends include:

Decentralizing data management and moving away from centralized data warehouses or even data lakes to domain- or department-specific data collections, all managed on a single platform.
Unifying data integrations, sometimes called data fabric, using AI and automation to connect data across platforms in hybrid or multi-cloud environments.
Processing in real-time, or ongoing streaming, to support applications like fraud protection, the function of IoT, and running AI.
Driving data management decisions with AI at the center to automate the basics of governance, quality checking, and optimization.
Using distributed databases and multiple models to ensure global scalability with high failover resilience.
Designing for cybersecurity and compliance with various frameworks and regulations in mind.

When thinking about anything related to data — which is arguably everything — you should always consider the data architecture.

Big Data vs. Data Analytics vs. Data Science: What’s the Difference?

BMC Software — Thu, 29 May 2025 00:00:42 +0000

Data has become the most critical factor in business today. As a result, different technologies, methodologies, and systems have been invented to process, transform, analyze, and store data in this data-driven world.

However, there is still much confusion regarding the key areas of Big Data, Data Analytics, and Data Science. In this post, we will demystify these concepts to better understand each technology and how they relate to each other.

Data TL:DR

Big data refers to any large and complex collection of data.
Data analytics is the process of extracting meaningful information from data.
Data science is a multidisciplinary field that aims to produce broader insights.

Each of these technologies complements one another yet can be used as separate entities. For instance, big data can be used to store large sets of data, and data analytics techniques can extract information from simpler datasets.

Read on for more detail.

What is big data?

As the name suggests, big data simply refers to extremely large data sets. This size, combined with the complexity and evolving nature of these data sets, has enabled them to surpass the capabilities of traditional data management tools. This way, data warehouses and data lakes have emerged as the go-to solutions to handle big data, far surpassing the power of traditional databases.

Some data sets that we can consider truly big data include:

Stock market data
Social media
Sporting events and games
Scientific and research data

(Read our full primer on big data.)

Characteristics of big data

Volume. Big data is enormous, far surpassing the capabilities of normal data storage and processing methods. The volume of data determines if it can be categorized as big data.
Variety. Large data sets are not limited to a single kind of data—instead, they consist of various kinds of data. Big data consists of different kinds of data, from tabular databases to images and audio data regardless of data structure.
Velocity. The speed at which data is generated. In Big Data, new data is constantly generated and added to the data sets frequently. This is highly prevalent when dealing with continuously evolving data such as social media, IoT devices, and monitoring services.
Veracity or variability. There will inevitably be some inconsistencies in the data sets due to the enormity and complexity of big data. Therefore, you must account for variability to properly manage and process big data.
Value. The usefulness of Big Data assets. The worthiness of the output of big data analysis can be subjective and is evaluated based on unique business objectives.

Types of big data

Structured data. Any data set that adheres to a specific structure can be called structured data. These structured data sets can be processed relatively easily compared to other data types as users can exactly identify the structure of the data. A good example for structured data will be a distributed RDBMS which contains data in organized table structures.
Semi-structured data. This type of data does not adhere to a specific structure yet retains some kind of observable structure such as a grouping or an organized hierarchy. Some examples of semi-structured data will be markup languages (XML), web pages, emails, etc.
Unstructured data. This type of data consists of data that does not adhere to a schema or a preset structure. It is the most common type of data when dealing with big data—things like text, pictures, video, and audio all come up under this type.

(Get a deeper understanding of structured and unstructured data types.)

Big data systems & tools

When it comes to managing big data, many solutions are available to store and process the data sets. Cloud providers like AWS, Azure, and GCP offer their own data warehousing and data lake implementations, such as:

AWS Redshift
GCP BigQuery
Azure SQL Data Warehouse
Azure Synapse Analytics
Azure Data Lake

Apart from that, there are specialized providers such as Snowflake, Databriks, and even open-source solutions like Apache Hadoop, Apache Storm, Openrefine, etc., that provide robust Big Data solutions on any kind of hardware, including commodity hardware.

What is data analytics?

Data Analytics is the process of analyzing data in order to extract meaningful data from a given data set. These analytics techniques and methods are carried out on big data in most cases, though they certainly can be applied to any data set.

(Learn more about data analysis vs . data analytics)

The primary goal of data analytics is to help individuals or organizations to make informed decisions based on patterns, behaviors, trends, preferences, or any type of meaningful data extracted from a collection of data.

For example, businesses can use analytics to identify their customer preferences, purchase habits, and market trends and then create strategies to address them and handle evolving market conditions. In a scientific sense, a medical research organization can collect data from medical trials and evaluate the effectiveness of drugs or treatments accurately by analyzing those research data.

Combining these analytics with data visualization techniques will help you get a clearer picture of the underlying data and present them more flexibly and purposefully.

Types of analytics

While there are multiple analytics methods and techniques for data analytics, there are four types that apply to any data set.

Descriptive. This refers to understanding what has happened in the data set. As the starting point in any analytics process, the descriptive analysis will help users understand what has happened in the past.
Diagnostic. The next step of descriptive is diagnostic, which will consider the descriptive analysis and build on top of it to understand why something happened. It allows users to gain knowledge on the exact information of root causes of past events, patterns, etc.
Predictive. As the name suggests, predictive analytics will predict what will happen in the future. This will combine data from descriptive and diagnostic analytics and use ML and AI techniques to predict future trends, patterns, problems, etc.
Prescriptive. Prescriptive analytics takes predictions from predictive analytics and takes it a step further by exploring how the predictions will happen. This can be considered the most important type of analytics as it allows users to understand future events and tailor strategies to handle any predictions effectively.

Accuracy of data analytics

The most important thing to remember is that the accuracy of the analytics is based on the underlying data set. If there are inconsistencies or errors in the dataset, it will result in inefficiencies or outright incorrect analytics.

Any good analytical method will consider external factors like data purity, bias, and variance in the analytical methods. Data Normalization, purifying, and transforming raw data can significantly help in this aspect.

Data analytics tools & technologies

There are both open source and commercial products for data analytics. They will range from simple analytics tools such as Microsoft Excel’s Analysis ToolPak that comes with Microsoft Office to SAP BusinessObjects suite and open source tools such as Apache Spark.

When considering cloud providers, Azure is known as the best platform for data analytics needs. It provides a complete toolset to cater to any need with its Azure Synapse Analytics suite, Apache Spark-based Databricks, HDInsights, Machine Learning, etc.

AWS and GCP also provide tools such as Amazon QuickSight, Amazon Kinesis, GCP Stream Analytics to cater to analytics needs.

Additionally, specialized BI tools provide powerful analytics functionality with relatively simple configurations. Examples here include Microsoft PowerBI, SAS Business Intelligence, and Periscope Data Even programming languages like Python or R can be used to create custom analytics scripts and visualizations for more targeted and advanced analytics needs.

Finally, ML algorithms like TensorFlow and scikit-learn can be considered part of the data analytics toolbox—they are popular tools to use in the analytics process.

Difference between data analytics and big data analytics

Size and scale are not the only things that distinguish big data analytics from ordinary data analytics. Understanding the differences will help you understand big data analytics concepts and the big data that powers it.

Scale: Data analytics can handle small- to medium-sized datasets, whereas big data analytics is specifically designed for very large and complex datasets.
Tools and infrastructure: Data analytics typically uses simpler tools and can often be performed on a single machine, while big data analytics requires distributed computing and more advanced, scalable tools.
Data types: Data analytics often deals with structured data (like relational databases), while big data analytics often involves unstructured or semi-structured data (like text, images, or sensor data), in addition to structured data.
Processing: Big data analytics often involves real-time or near-real-time processing due to the high velocity of data, while data analytics may not necessarily require such speed.
Functionality: Big data analytics takes in more advanced capabilities, including machine learning and artificial intelligence, for in-depth analysis yielding valuable insights.
Security challenges: Because of the size and scale of big data, third-party storage is the norm, which requires much greater cybersecurity protections.

What is data science?

Now we have a clear understanding of big data and data analytics. So—what exactly is data science?

Unlike the first two, data science cannot be limited to a single function or field. Data science is a multidisciplinary approach that extracts information from data by combining:

Scientific methods
Maths and statistics
Programming
Advanced analytics
ML and AI
Deep learning

In data analytics, the primary focus is to gain meaningful insights from the underlying data. The scope of Data Science far exceeds this purpose—data science will deal with everything, from analyzing complex data, creating new analytics algorithms and tools for data processing and purification, and even building powerful, useful visualizations.

Data science tools & technologies

This includes programming languages like R, Python, Julia, which can be used to create new algorithms, ML models, AI processes for big data platforms like Apache Spark and Apache Hadoop.

Data processing and purification tools such as Winpure, Data Ladder, and data visualization tools such as Microsoft Power Platform, Google Data Studio, Tableau to visualization frameworks like matplotlib and ploty can also be considered as data science tools.

As data science covers everything related to data, any tool or technology that is used in Big Data and Data Analytics can somehow be utilized in the Data Science process.

Big data vs. data analytics

Big data is the term for large amounts of data that are growing in volume at a rapid rate. The data can be in different forms, such as structured, unstructured, and a blend of both. To process and store it, you need parallel computing capacity and advanced data management tools. Data scientists with big data expertise need to be well-versed in programming NoSQL databases, along with distributed systems and frameworks. Big data analytics is routinely used in financial services, retail, media, entertainment, and communications.

Data analytics is the process of drawing insights from the analysis of raw data, using techniques that range from descriptive to diagnostic, predictive, and prescriptive. While data scientists may be helpful, skilled data analysts with a background in programming, statistics, and mathematics are well able to handle the challenges. Data analytics techniques are used to identify and manage risks, in energy management, and for gaming, healthcare, travel, and science.

Big data vs. data science

Data management: Big data is concerned with storing and processing large amounts of data, while data science uses data that is already cleaned and processed.
Data types: Big data involves many types of data in structured, unstructured, and semi-structured forms. Data science constructs models and programs or algorithms that use various data types.
Speed of change: Big data rapidly grows in volume and data science uses a variety of tools and techniques to handle the analysis of it.
Timing: Big data supports real-time data processing. Data science applies statistical techniques and machine learning for real-time and near-real-time insights and trend tracking.
Distribution: Big data is in repositories that can be queried by data scientists, typically in a networked computing environment.

Data analytics vs. data science

Both data analytics and data science are closely related fields that deal with gaining insights, tracking trends, and using past data to make decisions about the future.

Data science is the broader concept, of which data analytics is a part. It also includes data engineering, machine learning, statistics, programming predictive models, and the development and programming of algorithms.

Data analytics is focused on discovering answers and insights to specific questions in order to address the needs of an organization. The work involves preparing specific datasets, performing various types of analyses, finding insights, and presenting the information in ways that are useful for decision makers.

Data science vs. big data vs. data analytics

Aspect	Data Science	Big Data	Data Analytics
Definition	The interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.	The collection, storage, and processing of large volumes of diverse data types that exceed traditional processing capabilities.	The process of examining datasets to draw conclusions, identify patterns, and support decision-making.
Objective	To derive actionable insights and build predictive models that inform decisions and drive innovation.	To store, manage, and process massive datasets efficiently and reliably, often in real-time or near-real-time.	To provide meaningful insights and support operational or strategic decisions through data examination.
Focus	Data exploration, model building, algorithm development, and interpretation of complex data.	Handling the 3Vs (volume, velocity, variety) of data and ensuring scalable and efficient data pipelines.	Analysis and reporting of data to identify trends, measure performance, and inform decisions.
Primary Tasks	Data collection, cleaning, exploration, modeling (machine learning), validation, visualization, and communication of insights.	Ingesting, storing, and processing data from diverse sources; ensuring data quality, scalability, and integration.	Descriptive statistics, trend analysis, visualization, and basic predictive insights (often via dashboards and reports).
Tools/Technologies	Python, R, Jupyter Notebook, TensorFlow, PyTorch, Scikit-Learn, SQL, Spark (for some ML tasks).	Hadoop, Spark, HDFS, NoSQL databases (e.g., MongoDB, Cassandra), real-time streaming platforms (e.g., Kafka).	SQL, Excel, Tableau, Power BI, Google Data Studio, some scripting in Python/R for lightweight analytics.
Data Types	Structured, semi-structured, and unstructured data that has been pre-processed and cleaned for analysis.	Structured, semi-structured, and unstructured data – often raw, massive, and in varied formats (text, images, logs, etc.).	Primarily structured and semi-structured data (often sourced from processed Big Data or data warehouses).

Data is the future

Ultimately, big data, data analytics, and data science all help individuals and organizations tackle enormous data sets and extract valuable information out of them. As the importance of data grows exponentially, they will become essential components in the technological landscape.

Data integrity vs data quality: What’s the difference?

BMC Software — Wed, 28 May 2025 00:00:10 +0000

Big Data has been labeled the new oil—parallels that describe the value of big data to our economy and business. Like oil, data has degrees of value based on its quality, integrity, and refining process. Different aspects of its quality encompass data accuracy, completeness, consistency, timeliness, validity, and uniqueness.

The terms quality and integrity can get mixed, but for data-driven businesses, the parameters and metrics that define the quality and integrity of data have vastly different implications. And that’s why we put together this brief primer—so you can fully understand the differences between Data Quality and Data Integrity.

What is data quality?

Data Quality characterizes how reliable the information is to serve some intended purpose. This purpose might be:

Planning
Decision-making
Operations

When the data is complete, full of all features and attributes, it is usable information to address specific real-world circumstances.

(Learn all about data quality management.)

What is data integrity?

Data Integrity characterizes how reliable the information is in terms of its physical and logical validity. Data Integrity is based on parameters such as:

Accuracy
Validity
Consistency of the data across its lifecycle

Integrity is the absence of unintended change to the information between two successive updates or modification to data records. Let’s consider Data Integrity as the polar opposite to data corruption, which renders the information ineffective for fulfilling desired data requirements.

Difference between data quality and data integrity

When discussing reliable data and proper data management, you must consider two essential concepts: data quality and data integrity.

Data quality is about accuracy, timeliness, completeness, and relevance. High-quality data supports accurate analyses and reporting for better decision-making.

Data integrity is the reliability and trustworthiness of data over time. It must be traceable to original sources and protected from unauthorized tampering or inadvertent alteration over time.

Aspect	Data Quality	Data Integrity
Definition	The condition of being correct, complete, and up-to-date	The condition of being reliable and trustworthy
Focus	Usability for insightful analysis	Validity of secure, authentic information
Importance	High-quality data is needed for well-informed decision-making	Data must be trustworthy and consistent across platforms over time
Example	Poor quality data might be missing customer contact details, which makes it difficult to reach that customer with marketing messages or to provide superior customer service.	Poor data integrity would result from a customer’s purchase history being altered due to a system error, leading to inconsistencies and an inability to trust or rely on the record.

Characteristics of data integrity and data quality

Data that has integrity can be quality data, but not all quality data has integrity. Below, we describe some characteristics of quality and integrity:

1. Completeness

Completeness is an indication of the comprehensiveness of the available data. It can be measured as a percentage to state how much of the entire possible data set is available. The proportion required to classify the data as “complete” determines on the business and particular reason for it.

For instance, consider a list of health records of patients visiting the medical facility between specific dates and sorted by first and last names. The data resource will be considered 100% complete as long as it includes all necessary health records, and the first and last names, within specific dates. In fact, even if it doesn’t include the address or phone numbers of the patients, we can consider it complete because the ask did not include this information.

Conversely, the percentage of completeness reduces as any critical data item(s) are absent.

2. Uniqueness

Uniqueness is a measurement of duplication. Does the data, or similar data, exist multiple times within the same dataset? It is a discrete measure on particular items within a data set.

Example: Consider the same list of health records as mentioned earlier. There are 100 patients in a hospital. If the list contains more than 100 patients, then one or more patients must have had their data duplicated and listed as a separate entity. Whichever patient’s list record is duplicated is considered not unique. Depending upon the circumstances and business requirements for the data analysis, this duplication could lead to skewed results and inaccuracies.

Mathematically, uniqueness may be defined as 100% if the number of data items in the real-world context is unique and equal to the number of data items identified in the available data set.

3. Timeliness

Timeliness is the degree to which data is up-to-date and available within an acceptable time frame, timeline, and duration.

The value of data-driven decisions not only depends on the correctness of the information but also on quick and timely answers. The time of occurrence of the associated real-world events is considered as a reference and the measure is assessed on a continuous basis. The value and accuracy of data may decay over time.

For instance, data about the number of traffic incidents from several years ago may not be completely relevant to make decisions on road infrastructure requirements for the immediate future.

4. Validity

Data validity is a test of whether the data is in the proper format. Does the data input match the required input format? Examples include:

Is a birth date written as Month, Day, Year or as Day, Month, Year?
Are times based on local time zones, user device time, or the global UTC time?

The scope of syntax may include the allowable type, range, format, and other attributes of preference.

In particular, validity is measured as a percentage proportion of valid data items compared to the available data sets (i.e.; “The 90% of the data is valid.”) In the context of Data Integrity, the validity of data also includes the relationships between data items that can be traced and connected to other data sources for validation purposes. Failure to establish links of valid data items to the appropriate real-world context may deem the information as inadequate in terms of its integrity.

Data validity is one of the critical dimensions of Data Quality and is measured alongside the related parameters that define data completeness, accuracy, and consistency—all of which also impact Data Integrity.

5. Accuracy

Accuracy is the degree to which the data item correctly describes the object in context of appropriate real-world context and attributes.

The real-world context may be identified as a single version of established truth and used as a reference to identify the deviation of data items from this reference. Specifications of the real-world references may be based on business requirements and all data items that accurately reflect the characteristics of real-world objects within allowed specifications may be regarded as an accurate piece of information.

Data accuracy directly impacts the correctness of decisions and should be considered as a key component for data analysis practices.

6. Consistency

Consistency measures the similarities between data items representing the same objects based on specific information requirements. The data may be compared for consistency within the same database or against other data sets of similar specifications. The discrete measurement can be used as an assessment of data quality and may be measured as a percentage of data that reflects the same information as intended for the entire data set.

In contrast, inconsistent data may include the presence of attributes that are not expected for the intended information. For instance, a data set containing information on app users is considered inconsistent if the count of active users is greater than the number of registered users.

So, timeliness and uniqueness of data are useful to understand the overall quality of data instead of the integrity of information. Data completeness, accuracy, and consistency are good measurements of data integrity.

And, what each data item will actually be is unique to each organization. That responsibility is left up to you. Now let’s turn to look at data integrity in the real world.

Data integrity in practice

Data quality and integrity are important in the machine learning and analytics worlds. When data is the resource from which all decisions are based, then quality data allows for quality decisions. But what happens when your data is invalid, inaccurate, or inconsistent?

Let’s see!

Decisions on invalid data…

Maybe for legal reasons. Differing formats. Anomaly detection.

Invalid data changes the actual input data, and, if left as is, the decisions that are made are completely wrong. It’d be like determining to feed a person a hearty breakfast because they always eat dinner at 7pm, but their time data is invalid and set to a different time zone, and, really, they should be eating their dinner.

When data points are invalid, it makes decisions around detected anomalous data points weaker, and decisions about the actions to take weaker as well. Invalid data can sometimes be redeemed by interpreting it (such as converting errors to what you think they should be), but that comes at the cost of time, and labor, and the fact that the truth gets lost in translation.

Decisions on inaccurate data…

Accurate data is easy. Generally, when gathering data, people ask questions relevant to their domain—they understand what is useful to their business and what is not.

Where inaccuracy most commonly occurs is when data comes from an outside source and is retrofitted to suit one’s needs. In machine learning, this is called Domain Switching, and, with no surprise, a machine learning model trained to predict someone’s opinion of a movie from a dataset of IMDB movie reviews will perform poorly when set to predict someone’s opinion of their Friday night date.

When the purpose of the model has switched, the dataset has become inaccurate, and the model’s performance suffers.

Decisions on inconsistent data…

Like in the real world, if the data is inconsistent, then the outcomes are unpredictable. All models and decisions, if they are to be modeled, require patterns in behavior. Consistency is a prerequisite to pattern detection, and if the data is inconsistent no patterns can be detected.

In the early days of AI imaging, converting handwriting was a hard task because everyone’s penmanship was inconsistent. The Palm Pilot, one of the first handheld touchscreen devices, developed its own written alphabet to help its users and its device communicate with one another.

To combat data inconsistency, the solution can vary. You might:

Increase the size of the data set.
Redefine the data required so that what is collected can be consistent.

How do you ensure data quality and data integrity?

Effective data management involves taking practical steps to ensure both data quality and data integrity. The most sophisticated analytics will not deliver good insights if you don’t have the proper processes, controls, and tools in place to deliver data quality and data integrity. These strategies and best practices address both aspects:

Data quality assurance practices

Data profiling and monitoring: Analyze the structure, content, and quality of data to uncover anomalies, missing values, inconsistencies, and duplication.
Standardization and cleansing: Ensure data is consistently formatted and remove inaccurate, incomplete, or irrelevant data.
Data validation processes: Check data against predefined rules and constraints before it enters a system to prevent data entry errors.
Regular audits and reviews: Regularly examine datasets for accuracy, completeness, and compliance with policies and standards to identify quality issues early.

Data integrity measures

Access controls and permissions: Define who can view, edit, or work with data, typically based on roles and responsibilities, to prevent unauthorized access and tampering, and simple human error.
Encryption and secure storage: Protect data by converting it to an unreadable format that only those with the right credentials and key can decipher.
Version control and audit trails: Track changes and additions to data over time with audit logs showing who accessed data, when, and what they did, and regularly save versions so you can restore data from a specific point in time.
Data backup and recovery plans: Prevent data loss or corruption from cyberattackers, disaster, or inadvertent human error with regular backups and a solid recovery plan.

Why data integrity and accuracy matter

Data quality and data integrity are pillars of reliable data that support smart and timely decision-making. Reliable data is essential to the analytics that fuel developing strategies, efficient operations, profitable innovation, cybersecurity, growth, and more. In a data-driven world, good data is the foundation of success.

What’s Data Masking? Types, Techniques and Examples

BMC Software — Fri, 16 May 2025 00:00:12 +0000

Data breaches worldwide expose millions of people’s sensitive data each year, causing many business organizations to lose millions. In fact, in 202 4 , the average cost of a data breach was $4.88 million. Personally Identifiable Information (PII) is the costliest type of data among all the compromised data types.

Consequently, data protection has become the top priority of many organizations. That’s why data masking has become an essential technique many businesses need to protect their sensitive data.

What is data masking?

Data masking, also known as data obfuscation, hides the actual data using modified content like characters or numbers.

The main objective of data masking is creating an alternate version of data that cannot be easily identifiable or reverse engineered, protecting data classified as sensitive. Importantly, the data will be consistent across multiple databases, and the usability will remain unchanged.

There are many types of data that you can protect using masking, but common data types ripe for data masking include:

PII: Personally identifiable information
PHI: Protected health information
PCI-DSS: Payment card information
ITAR: Intellectual property

Data masking generally applies to non-production environments, such as software development and testing, user training, etc.—areas that do not need actual data. You can use various techniques to mask which we will discuss in the following sections of this article.

(Check out our big data & data security explainers.)

Importance of data masking

Data masking is important to companies in several ways:

Helps companies to stay compliant with General Data Protection Regulation (GDPR) by eliminating the risk of sensitive data exposure. Because of this, data masking offers a competitive advantage for many organizations.
Makes data useless for cyberattackers while preserving its usability and consistency.
Reduces risks associated with sharing the data with integrated third-party applications and cloud migrations.
Avoids risks associated with outsourcing any project. Because most organizations merely rely on trust when dealing with outsourced persons, masking prevents data from being misused or stolen.

Types of data masking

There are several types of data masking types you can use depending on your use case. Of the many, static and on-the-fly data masking are the most common.

Static data masking (SDM)s

Static data masking generally works on a copy of a production database. SDM changes data to look accurate in order to develop, test, and train accurately—without revealing the actual data. The process goes like this:

Take a backup or a golden copy of the production database to a different environment.
Remove any unnecessary data, and mask it while in stasis.
Save the masked copy to the desired location.

Dynamic data masking (DDM)

DDM happens dynamically at run time and streams data directly from a production system so that masked data will not need to be saved in another database. It is primarily used for processing role-based security for applications, such as processing customer inquiries and handling medical records. Thus, DDM applies to read-only scenarios to prevent writing the masked data back to the production system.

You can implement DDM using a database proxy which modifies the queries that come to the original database and passes the masked data to the requesting party. With DDM, you do not have to prepare a masked database in advance, but the application can have performance hindrances.

Deterministic data masking

Deterministic data masking involves replacing column data with the same value.

For example, if there is a first name column in your databases that consists of multiple tables, there could be many tables with the first name. If you mask ‘Adam’ to ‘James,’ it should show you as ‘James’ not only in the masked table but also in all associated tables. Whenever you run the masking, it will give you the same result.

On-the-fly data masking

On-the-fly data masking occurs when data transfers from production environments to another environment, like test or development. On-the-fly data masking is ideal for organizations that:

Deploy software continuously
Have heavy integrations

Because it is challenging to keep a backup copy of masked data continuously, this process will send only a subset of masked data when needed.

Statistical data obfuscation

The production data can hold different statistical information, which statistical data obscuration techniques can masquerade. Differential privacy is one technique where you can share information about patterns in a data set without revealing information about the actual individuals in the data set.

Data masking techniques

Now let’s look at techniques for data masking.

Encryption

Encryption is the most complex—and most secure—type of data masking. Here you use an encryption algorithm that masks the data and requires a key (encryption key) to decrypt the data.

Encryption is more suitable for production data that needs to return to its original state. However, the data will be safe as long as only authorized users have the key. If any unauthorized party compromises, the keys can decrypt the data and view the actual data. Thus proper management of the encryption key is crucial.

Scrambling

Scrambling is a basic masking technique that jumbles the characters and numbers into a random order hiding the original content. Although this is a simple technique to implement, you can only apply it to certain types of data, and it does not make sensitive data as secure as you might expect.

For example, when an employee with ID number 934587 in a production environment goes through character scrambling, it will read 489357 in another environment. Anyone who remembers the original order may still be able to decipher its original value, however.

Nulling out

Nulling out masks the data by applying a null value to a data column so that any unauthorized user does not see the actual data in it. This is another simple technique, but the main problems are that it:

Reduces data integrity
Makes testing and development with such data harder

Substitution

Substitution is masking the data by substituting it with another value. This is one of the most effective data masking methods that preserve the original look like the feel of the data.

The substitution technique can apply to several types of data. For example, masking customer names with a random lookup file. This can be pretty difficult to execute, but it is a very effective way of protecting data from breaches.

Shuffling

Shuffling is similar to substitution, but it uses the same individual masking data column for shuffling in a randomized fashion.

For instance, shuffling employee names columns across multiple employee records. The output data looks like accurate data but doesn’t reveal any actual personal information. However, if anyone gets to know the shuffling algorithm, shuffled data is prone to reverse engineering.

Number & date variance

The number and data variance method is applicable for masking important financial and transaction date information.

For instance, masking the employee salaries column with the employee salary variance will show the salaries between the highest and lowest paid employees. You can ensure the meaningfulness of the data set by applying the variance around +/- 10% to all salaries in the set.

Date aging

This masking technique either increases or decreases a date field based on the defined data masking policy with an acceptable date range. For example, decreasing the date of the birth field by 1000 days would change the date ‘1-Jan-2021′ to ’07-Apr-2018.’

Data masking examples

The following examples show how various techniques for masking data work:

Customer data masking in financial services: You may need to protect customer credit card numbers, Social Security numbers, and other sensitive data when you share it with others. A credit card number 4111-1234-5678-9012 can be masked to look like 4111-XXXX-XXXX-9012.
Patient data privacy in healthcare: Various regulations like HIPAA and frameworks such as Section 405(d) of the Health Care Cybersecurity and Resiliency Act require protecting patient information. You can still use the data if you mask it. For example, a patient named “John Winston” born on May 15, 2001, can be masked as “Patient 1234” born on “05/XX/2001.”
Employee salary anonymization for HR analytics: Your organization may want to analyze salary trends, but will want to maintain individual employee privacy. You can mask the employee’s name with a number such as “Emp 001” and associate their salary with this identifier.
Data masking in software testing for e-commerce: It can be helpful to test systems with realistic data, but you will need to keep real customer data safe. You can mask something like a shipping address as “123 Test Street, Test City, TS, 99999.”

Data masking best practices

Ready to start masking data? Here are some best practices to follow.

Identify the sensitive data

Before masking any data, identify and catalog the:

Sensitive data location(s)
Authorized person(s) who can view them
Their usage

Every single data element of a company does not need masking. Instead, thoroughly identify the existing sensitive data in both production and non-production environments. Depending on the complexity of data and the organizational structure, this may require a significant amount of time.

Define your stack of data masking techniques

It is not practical for large organizations to use only a single masking tool across the entire enterprise since data varies greatly. Plus, the technique you choose may require you to comply with specific internal security policies or meet budgetary requirements. In some cases, you may have to develop your masking technique.

So, consider all these necessary factors to choose the right set of techniques. Keep them in sync to ensure the same type of data uses the same technique to preserve referential integrity.

Secure your data masking techniques

Masking techniques and associated data are as critical as sensitive data. For example, the substitution technique can use a lookup file for substitution. If this lookup file falls into the wrong hands, they can reveal the original data set.

Organizations should establish the required guidelines to allow only authorized persons to access the masking algorithms.

Make masking repeatable

Over time, changes to an organization or a particular project or product can result in changes to the data. Avoid starting from square one each time. Instead, make masking a process that is repeatable, quick, and automatic, so you can implement them when changes to the sensitive data occur.

Define an end-to-end data masking process

Organizations must have an end-to-end process that includes:

Identifying sensitive information
Applying the appropriate data masking technique
Continuously auditing to ensure data masking is working as expected

Data masking is essential

Data masking is an essential process for many organizations that protect sensitive data by concealing its authenticity.

What Is Dark Data? The Basics & The Challenges

BMC Software — Mon, 05 May 2025 00:00:36 +0000

Dark data and unstructured data are about the same thing. The difference lies in to whom the term is directed. Unstructured data tends to be a word directed at engineers. It refers to the structural qualities of the data, signaling to the engineer how they’ll have to go about refining the data to make any use of it.

Unstructured data is unrefined data, requiring more work to make it usable; structured data is already refined data where the data’s purpose is already determined. Unstructured data is the yin to structured data’s yang, but, mostly, unstructured data comes from an engineering-centric point of view.

What is dark data?

Dark data, however, emerges from the user-centric point of view. Where structured data refers to the structural qualities of the data, dark data refers to the visible qualities of the data. There is data the user can see, like Instagram photos, profile names, hashtags, but then there is data the user cannot see. The Dark Data.

On a social media platform like Instagram, the dark data would be:

How many login instances does the user have?
Does their user activity cluster around certain times of the day?
How many people liked the post who have large networks of users? (To measure a user’s clout.)
From where was the photo taken?
Where was the person when they posted the photo?

People can get overwhelmed by seeing so much data. Standard design practice says Keep It Simple Stupid (KISS) and holds white space as its central virtue. Instagram even decreased the amount of data it showed by generalizing the number of likes a photo would get from a very specific 134,392 to simply saying, “Thousands”.

When the users are the engineers, dark data will refer to unstructured data that does not get analyzed. It’s the data stored through various network processes on servers and in data lakes that ends up sitting around to satisfy the industry’s statute of limitations or is kept because data storage can be so cheap.

Dark data examples

The types of dark data that exist are industry specific. Background weather data might be collected in a running app, and browser history might be collected in a shopping app.

Basically, anything that is sent over the internet has potential to be, and create dark data. Packages are sent from point A to point B. While those packages can be encrypted and those looking in can have a hard time seeing what is in the package itself, there are other known entities in the process.

Types of dark data include:

Log files (servers, systems, architecture, etc.)
Previous employee data
Financial statements
Geolocation data
Raw survey data
Surveillance video footage
Customer call records
Email correspondences
Notes, presentations, or old documents

How much data is dark data?

In order to make software services work, some data must be collected. IP address must be known to get data from somewhere else on the network and return it to a user somewhere else on the network. Artificial Intelligence-backed services are showing how the more data a company has on a user, the better the service they can provide.

The IDC estimates that 90% of data is unstructured data. A.I. is helping make more use of this unstructured data, which should decrease the numbers, but it is so much easier to collect unstructured data than it is to build Machine Learning models to actually do something with it, that, likely, that percentage will increase greatly. In just a few years, dark data could comprise 95-97% of the total percentage of data. If the trend continues, reasonably, Dark Data could comprise 99%+ of all data.

The number is neither good nor bad. Having 99.9999% of all data in the world be dark data means little. It just means there sits a lot of unused data. If anything, that number should signal there might be a great opportunity to turn data into something no one else has.

What are the risks of dark data?

Collecting and storing vast quantities of data that you don’t need and don’t use is not harmless. Dark data opens your organization up to risks and costs.

Data security risks: Because dark data isn’t used, the out-of-sight, out-of-mind mentality can take over. All too often, little thought goes into storing and handling it. Dark data can contain sensitive information that is at risk as a result of such lax data security.
Compliance and regulatory risks: Collecting data you don’t need, particularly when it contains personal or identifiable information, and not putting it under proper protections, can lead to non-compliance with data protection regulations like GDPR or CCPA.
Operational costs: Collecting, storing, and maintaining data is not cheap. Consider the impact on your IT infrastructure. With respect to dark data, you are not getting value to offset the cost.
Missed opportunities: Dark data may contain valuable insights. If you never analyze it, you may miss the chance to uncover trends, boost efficiency, or find ways to generate additional revenue.

How to handle dark data

Privacy with dark data

People are creating their technological footprint with data. This is fine when people don’t mind if others know where they’ve been walking, but, sometimes there’s other items—medical queries, Google searches, less savory sites, and even information you need to hide from a partner or relative—that individuals don’t want others to see.

When it comes to data, security is very challenging.

Challenge 1: Anonymous data

People often think the first step to securing data is to anonymize the data. This means that all the data points can exist, but they’ll remove any account numbers, names, email addresses, etc., from the person’s data so it can’t identify them directly. That method worked in elementary school, when a name was removed from an assignment someone turned in, and it could work for someone like Frank Abagnale as he put new names on checks and diplomas to parade around the country as an airline pilot, doctor, and lawyer.

But data in the technological world works differently. Any set of data points is an identifier. Five data points linked to one person, regardless of the name being given, are an identifier. If someone is known to wake in the morning, go for a walk, sneeze, yawn, kick a rock, go back to sleep, that is an impression of a single identity stamped upon the world.

Challenge 2: Intersections of data

There is so much data out there, that a person’s name can exist among another set of data. Then, when these datasets have data points that intersect, the two sets are cross-referenced, it’s possible to place an identity upon the anonymized data. Creating a Venn diagram of different data sources and finding which ones overlap is a simple option, and statistics invites more complex methods to deanonymize data.

There’s the story of a legal case where an old lady was hit by a car and the car drove off. The woman was able to say the car was yellow (she didn’t know the make), and the driver was a brown man with dark hair. That is not a lot to go off, but a few more dark data points add the time of day of the accident, and the location of the accident. From these four data points, in a town of about 120,000, the investigators were able to narrow down their search, from what seemed to be impossible odds, to having only a few suspects who could have hit the woman.

Similarly, from the technology world, the 7scientists research team presented a similar case at Defcon (see below for video clip). They purchased anonymous browsing data, which is easy to purchase, and showed they were able to identify the user from it based on just five data points.

The graph illustrates how many possible users the browsing data could belong to after each known data point was added.

Open source data privacy

Open Mined is an open-source research group working to make data more privacy-preserving. In a world with more and more dark data, their work benefits the general population to make data more anonymous and ensure that identities are made private even in the increasing amounts of available data.

Specifically, machine learning models are trained upon data. Machine learning models can both offer high value and work with sensitive data. While all data can be considered sensitive, and can be treated equally, legal conditions put medical records among the most sensitive.

Thus, training machine learning models on people’s medical histories is very difficult in nature because of how sensitive the industry has treated the records in the past. Challenges include: not enough data, data being isolated to different locations for security purposes, having to jump through many extra hoops to meet “best safety practices” created by regulatory institutions.

The goal of Open Mined is two-fold: to create a framework where people get paid for their data and to truly anonymize data when passed through ML models. To that end, the open-source group currently offers three major software solutions:

Encrypted Machine Learning as a Service
Privacy Preserving Data Science Platform
Federated Learning

Security isn’t privacy

There is a lot of dark data out there, and there will likely be more. Security practices, as they are, do not preserve privacy with all the dark data points, but research groups are out there successfully improving the data landscape improving people’s privacy, and advocating for people to get paid for the data they create.

Dark data management

Given the issues and opportunities with used and possibly forgotten dark data, developing a formal process for managing it makes sense. You can eliminate the liabilities and risks while unlocking benefits.

Managing dark data can uncover insights that can improve your operations, strengthen customer experiences and loyalty, and lead to innovations and new revenue streams.

Managing dark data can also reduce costs and risks. You can reduce its impact on your IT infrastructure and the costs to collect and store it. You can also mitigate data privacy and data security compliance issues.

Some best practices for dark data management include:

Starting with a data audit that assesses the volume of unused data in your systems.
Implementing a data classification system that identifies valuable, sensitive, obsolete, and unneeded data.
Creating a system for making deletion and retention decisions, and specifying proper handling of retained data.
Ensuring security protocols like encryption, access control, and data lifecycle policies apply to dark data as well as data in regular use.

Learn more about the enterprise BMC AMI solution for data management

Dark analytics

Processing and using dark data to uncover what insights may be locked away in it, and then using those insights to make decisions, is the core of dark data analytics. The vast quantity of unstructured, unanalyzed, and forgotten data may be a gold mine for your organization.

Typical dark data includes log files, sensor data, archived emails, social media interactions, customer call recordings, service records, customer feedback, and more. You may find patterns, like recurring customer complaints that point out a product issue, or uncover a trend that points to an emerging customer need or cybersecurity breach.

Analyzing dark data can lead to a competitive advantage, a growth opportunity, and the mitigation of a previously unseen risk before it becomes a serious problem.

Benefits of dark analytics

Using dark analytics is a hidden superpower for your organization. You can mitigate risks, lower costs, gain a competitive advantage, and make smarter decisions faster with good dark data analytics. Here are some key ways to benefit:

Uncover obstructions, inefficiencies, and slowdowns in processes and operations
Find recurring service issues
Improve resource allocation and trend analysis
Discover customer pain points to address with product innovations
Track customer sentiment on social media about your brand and competitors
Identify behavioral patterns that can lead to customized interactions with users
Refine brand messaging and brand interactions
Learn about opportunities to upsell, cross-sell, or re-engage with customers
Shine a light on unknown security risks
Capture data to improve compliance

Additional resources

For more on this topic, explore the BMC Machine Learning & Big Data Blog or browse these articles:

MongoDB Indexes: Top Index Types & How to Manage Them

BMC Software — Mon, 21 Apr 2025 00:00:30 +0000

MongoDB indexes provide users with an efficient way of querying data. When querying data without indexes, the query will have to search for all the records within a database to find data that match the query.

In MongoDB, querying without indexes is called a collection scan. A collection scan will:

Result in various performance bottlenecks
Significantly slow down your application

Fortunately, using MongoDB indexes fixes both these issues. By limiting the number of documents to be queried, you’ll improve the overall performance of the application.

In this tutorial, I’ll walk you through different types of indexes and show you how to create, find and manage indexes in MongoDB.

(This article is part of our MongoDB Guide. Use the right-hand menu to navigate.)

What are indexes in MongoDB?

MongoDB indexes are special data structures that make it faster to query a database. They speed up finding and retrieving data by storing a small part of the dataset in an efficient way — you don’t have to scan every document in a data collection.

MongoDB indexes store the values of the indexed fields outside of the data collection and keep track of their location in the disk. The indexed fields are ordered by the values. That makes it easy to perform equality matches and to make range-based queries efficiently. You can define MongoDB indexes on the collection level, as indexes on any field or subfield in a collection are supported.

You can manage the indexes on your data collections by using either the Atlas CLI or the Atlas UI. Both make query execution more efficient.

Why do we need indexes in MongoDB?

Indexes are invaluable in MongoDB. They are an efficient way to organize information in a collection and they speed up queries, returning relevant results more quickly. By using an index to group, sort, and retrieve data, you save considerable time. Your database engine no longer needs to sift through each record to find matches.

What are the disadvantages of indexing?

Indexing does have some drawbacks. Performance on writes is affected by each index you create, and each one takes up disk space. To avoid collection bloat and slow writes, create only indexes that are truly necessary.

How many indexes can you use?

MongoDB indexes are capped at 64 per data collection. In a compound index, you can only have 32 fields. The $text query requires a special text index — you can’t combine it with another query operator requiring a different type of special index.

Working with indexes

For this tutorial, we’ll use the following data set to demonstrate the indexing functionality of MongoDB:

use students
db.createCollection("studentgrades")
db.studentgrades.insertMany(
    [
        {name: "Barry", subject: "Maths", score: 92},
        {name: "Kent", subject: "Physics", score: 87},
        {name: "Harry", subject: "Maths", score: 99, notes: "Exceptional Performance"},
        {name: "Alex", subject: "Literature", score: 78},
        {name: "Tom", subject: "History", score: 65, notes: "Adequate"}
    ]
)
db.studentgrades.find({},{_id:0})

Result

Are MongoDB indexes unique?

When creating documents in a collection, MongoDB creates a unique index using the _id field. MongoDB refers to this as the Default _id Index. This default index cannot be dropped from the collection.

When querying the test data set, you can see the _id field which will be utilized as the default index:

db.studentgrades.find().pretty()

Result:

How to create an index in MongoDB

To create an index in MongoDB, use the createIndex()method using the following syntax:

db..createIndex(, )

When creating an index, define the field to be indexed and the direction of the key (1 or -1) to indicate ascending or descending order.

If you use a descending single-field index, it can reduce performance. It is better to use ascending single-field indexes instead, if you want optimal results.

Another thing to keep in mind is the index names. By default, MongoDB will generate index names by concatenating the indexed keys with the direction of each key in the index using an underscore as the separator. For example: {name: 1} will be created as name_1.

The best practice is to use the name option to define a custom index name when creating an index. Indexes cannot be renamed after creation. The only way to rename an index is to first drop that index, which we show below, and recreate it using the desired name.

createIndex() example

Let’s create an index using the name field in the studentgrades collection and name it as student name index.

db.studentgrades.createIndex(
{name: 1},
{name: "student name index"}
)

Result:

Finding indexes in MongoDB

You can find all the available indexes in a MongoDB collection by using the getIndexes() method. This will return all the indexes in a specific collection.

db..getIndexes()

getIndexes() example

Let’s view all the indexes in the studentgrades collection using the following command:

db.studentgrades.getIndexes()

Result:

The output contains the default _id index and the user-created index student name index.

How to list indexes in MongoDB

You can list indexes on a data collection using Shell or Compass. This command will give you an array of index documents:

db.collection.getIndexes()

An alternative is to use MongoDB Atlas UI. Open a cluster and go to the Collections tab. Select the database and collection, then click on Indexes to see them listed.

Lastly, you can use the following command in MongoDB Atlas CLI to see the indexes:

atlas clusters index list --clusterName  --db  --collection

How to delete indexes in MongoDB

To drop or delete an index from a MongoDB collection, use the dropIndex() method while specifying the index name to be dropped.

db..dropIndex()

dropIndex() examples

Let’s remove the user-created index with the index name student name index, as shown below.

db.studentgrades.dropIndex("student name index")

Result:

You can also use the index field value for removing an index without a defined name:

db.studentgrades.dropIndex({name:1})

Result:

The dropIndexes command can also drop all the indexes excluding the default _id index.

db.studentgrades.dropIndexes()

Result:

What are the different types of indexes in MongoDB?

MongoDB provides different types of indexes that can be utilized according to user needs. Here are the main index types in MongoDB:

Single field index
Compound index
Multikey index

In addition to the popular Index types mentioned above, MongoDB also offers some special index types for targeted use cases:

Geospatial index
Test index
Hashed index

Single field index

These user-defined indexes use a single field in a document to create an index in an ascending or descending sort order (1 or -1). In a single field index, the sort order of the index key does not have an impact because MongoDB can traverse the index in either direction.

Example

db.studentgrades.createIndex({name: 1})

Result:

The above index will sort the data in ascending order using the name field. You can use the sort() method to see how the data will be represented in the index.

db.studentgrades.find({},{_id:0}).sort({name:1})

Result:

Compound index

You can use multiple fields in a MongoDB document to create a compound index. This type of index will use the first field for the initial sort and then sort by the preceding fields.

Example

In the following compound index, MongoDB will:

First sort by the subject field
Then, within each subject value, sort by grade

db.studentgrades.createIndex({subject: 1, score: -1})

The index would create a data structure similar to the following:

db.studentgrades.find({},{_id:0}).sort({subject:1, score:-1})

Result:

Multikey index

MongoDB supports indexing array fields. When you create an index for a field containing an array, MongoDB will create separate index entries for every element in the array. These multikey indexes enable users to query documents using the elements within the array.

MongoDB will automatically create a multikey index when encountered with an array field without requiring the user to explicitly define the multikey type.

Example

Let’s create a new data set containing an array field to demonstrate the creation of a multikey index in MongoDB.

db.createCollection("studentperformance")
db.studentperformance.insertMany(
[
{name: "Barry", school: "ABC Academy", grades: [85, 75, 90, 99] },
{name: "Kent", school: "FX High School", grades: [74, 66, 45, 67]},
{name: "Alex", school: "XYZ High", grades: [80, 78, 71, 89]},
]
)
db.studentperformance.find({},{_id:0}).pretty()

Result:

Now let’s create an index using the grades field.

db.studentperformance.createIndex({grades:1})

Result:

The above code will automatically create a Multikey index in MongoDB. When you query for a document using the array field (grades), MongoDB will search for the first element of the array defined in the find() method and then search for the whole matching query.

For instance, let’s consider the following find query:

db.studentperformance.find({grades: [80, 78, 71, 89]}, {_id: 0})

Initially, MongoDB will use the multikey index for searching documents where the grades array contains the first element (80) in any position. Then, within those selected documents, the documents with all the matching elements will be selected.

Geospatial Index

MongoDB provides two types of indexes to increase the efficiency of database queries when dealing with geospatial coordinate data:

2d indexes that use planar geometry which is intended for legacy coordinate pairs used in MongoDB 2.2 and earlier.
2dsphere indexes that use spherical geometry.

Syntax:

db..createIndex( {  : "2dsphere" } )

Text index

The text index type enables you to search the string content in a collection.

Syntax:

db..createIndex( { : "text" } )

Hashed index

MongoDB Hashed index type is used to provide support for hash-based sharding functionality. This would index the hash value of the specified field.

Syntax:

db..createIndex( {  : "hashed" } )

MongoDB index properties

You can enhance the functionality of an index further by utilizing index properties. In this section, you will get to know these commonly used index properties:

Sparse index property
Partial index property
Unique index property

Sparse index property

The MongoDB sparse property allows indexes to omit indexing documents in a collection if the indexed field is unavailable in a document and create an index containing only the documents which contain the indexed field.

Example

db.studentgrades.createIndex({notes:1},{sparse: true})

Result:

In the previous studentgrades collection, if you create an index using the notes field, it will index only two documents as the notes field is present only in two documents.

Partial index property

The partial index functionality allows users to create indexes that match a certain filter condition. Partial indexes use the partialFilterExpression option to specify the filter condition.

Example

db.studentgrades.createIndex(
{name:1},
{partialFilterExpression: {score: { $gte: 90}}}
)

Result:

The above code will create an index for the name field but will only include documents in which the value of the score field is greater than or equal to 90.

Unique index property

The unique property enables users to create a MongoDB index that only includes unique values. This will:

Reject any duplicate values in the indexed field
Limit the index to documents containing unique values

Example

db.studentgrades.createIndex({name:1},{unique: true})

Result:

The above-created index will limit the indexing to documents with unique values in the name field.

Indexes recap

That concludes this MongoDB indexes tutorial and guide. You learned how to create, find, and drop indexes, use different index types, and create complex indexes. These indexes can then be used to further enhance the functionality of the MongoDB databases, increasing the performance of applications which utilize fast database queries.

Machine Learning & Big Data Blog – BMC Software | Blogs

R2 Score & Mean Square Error (MSE) Explained

How regression models are used

Why R2 score is important in machine learning

Getting started with AIOps is easy. Learn how you can manage escalating IT complexity with ease! ›

What is variance?

What is the R2 score?

What is mean square error (MSE)?

How to calculate MSE in Python

What is a good Mean Squared Error (MSE)?

Interpreting r2 and MSE together

Additional Resources

Writing SQL Statements in Amazon Redshift

What is Amazon Redshift?

Redshift query editor

Get table schema

Aggregate SQL statements

Additional resources

Streaming Data Explained: Benefits, Architecture and Challenges

What is data streaming?

Streaming media

Real-time analytics

Benefits of streaming data

Data architecture for streaming data

Challenges with data streaming

Data streaming technologies

Popular streaming data technologies

Leveraging data streaming in your organization

Related reading

Snowflake vs Redshift: What’s The Difference & How To Choose?

Choose the right data warehouse

Snowflake vs AWS Redshift

What is Snowflake?

Benefits of Snowflake

When to use Snowflake

What is AWS Redshift?

Benefits of AWS Redshift

When to use Redshift

A quick comparison of AWS Redshift vs. Snowflake

Similarities between Snowflake and Redshift

Choosing Snowflake or Redshift

What Is Data Architecture? Components, Principles & Examples

What is data architecture?

Data architecture examples

Why is data architecture important?

Key components of data architecture

Common data architecture frameworks

What are data standards?

Data schemas

Data security

Shifting to new architecture

Related reading

Big Data vs. Data Analytics vs. Data Science: What’s the Difference?

Data TL:DR

What is big data?

Characteristics of big data

Types of big data

Big data systems & tools

What is data analytics?

Types of analytics

Accuracy of data analytics

Data analytics tools & technologies

Difference between data analytics and big data analytics

What is data science?

Data science tools & technologies

Big data vs. data analytics

Big data vs. data science

Data analytics vs. data science

Data science vs. big data vs. data analytics

Data is the future

Related reading

Data integrity vs data quality: What’s the difference?

What is data quality?

What is data integrity?

Difference between data quality and data integrity

Characteristics of data integrity and data quality

1. Completeness

2. Uniqueness

3. Timeliness

4. Validity