Machine Learning & Big Data Blog – BMC Software | Blogs https://s7280.pcdn.co Wed, 15 Nov 2023 09:53:23 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png Machine Learning & Big Data Blog – BMC Software | Blogs https://s7280.pcdn.co 32 32 Streamlining Machine Learning Workflows with Control-M and Amazon SageMaker https://s7280.pcdn.co/ml-workflows-controlm-sagemaker/ Fri, 10 Nov 2023 07:41:26 +0000 https://www.bmc.com/blogs/?p=53284 In today’s fast-paced digital landscape, the ability to harness the power of artificial intelligence (AI) and machine learning (ML) is crucial for businesses aiming to gain a competitive edge. Amazon SageMaker is a game-changing ML platform that empowers businesses and data scientists to seamlessly navigate the development of complex AI models. One of its standout […]]]>

In today’s fast-paced digital landscape, the ability to harness the power of artificial intelligence (AI) and machine learning (ML) is crucial for businesses aiming to gain a competitive edge. Amazon SageMaker is a game-changing ML platform that empowers businesses and data scientists to seamlessly navigate the development of complex AI models. One of its standout features is its end-to-end ML pipeline, which streamlines the entire process from data preparation to model deployment. Amazon SageMaker’s integrated Jupyter Notebook platform enables collaborative and interactive model development, while its data labeling service simplifies the often-labor-intensive task of data annotation.

It also boasts an extensive library of pre-built algorithms and deep learning frameworks, making it accessible to both newcomers and experienced ML practitioners. Amazon SageMaker’s managed training and inference capabilities provide the scalability and elasticity needed for real-world AI deployments. Moreover, its automatic model tuning, and robust monitoring tools enhance the efficiency and reliability of AI models, ensuring they remain accurate and up-to-date over time. Overall, Amazon SageMaker offers a comprehensive, scalable, and user-friendly ML environment, making it a top choice for organizations looking to leverage the potential of AI.

Bringing Amazon SageMaker and Control-M together

Amazon SageMaker simplifies the entire ML workflow, making it accessible to a broader range of users, including data scientists and developers. It provides a unified platform for building, training, and deploying ML models. However, to truly harness the power of Amazon SageMaker, businesses often require the ability to orchestrate and automate ML workflows and integrate them seamlessly with other business processes. This is where Control-M from BMC comes into play.

Control-M is a versatile application and data workflow orchestration platform that allows organizations to automate, monitor, and manage their data and AI-related processes efficiently. It can seamlessly integrate with SageMaker to create a bridge between AI modeling and deployment and business operations.

In this blog, we’ll explore the seamless integration between Amazon SageMaker and Control-M and the transformative impact it can have on businesses.

Amazon SageMaker empowers data scientists and developers to create, train, and deploy ML models across various environments—on-premises, in the cloud, or on edge devices. An end-to-end data pipeline will include more than just Amazon SageMaker’s AI and ML functionality, where data gets ingested from multiple sources, transformed, aggregated etc., before training a model and executing AI/ML pipelines with Amazon SageMaker. Control-M is often used for automating and orchestrating end-to-end data pipelines. A good example of end-to-end orchestration is covered in the blog, “Orchestrating a Predictive Maintenance Data Pipeline,” co-authored by Amazon Web Services (AWS) and BMC.

Here, we will specifically focus on integrating Amazon SageMaker with Control-M. When you have Amazon SageMaker jobs embedded in your data pipeline or complex workflow orchestrated by Control-M, you can harness the capabilities of Control-M for Amazon SageMaker to efficiently execute an end-to-end data pipeline that it also includes Amazon SageMaker pipelines.

Key capabilities

Control-M for Amazon SageMaker provides:

  • Secure connectivity: Connect to any Amazon SageMaker endpoint securely, eliminating the need to provide authentication details explicitly
  • Unified scheduling: Integrate Amazon SageMaker jobs seamlessly with other Control-M jobs within a single scheduling environment, streamlining your workflow management
  • Pipeline execution: Execute Amazon SageMaker pipelines effortlessly, ensuring that your ML workflows run smoothly
  • Monitoring and SLA management: Keep a close eye on the status, results, and output of Amazon SageMaker jobs within the Control-M Monitoring domain and attach service level agreement (SLA) jobs to your Amazon SageMaker jobs for precise control
  • Advanced capabilities: Leverage all Control-M capabilities, including advanced scheduling criteria, complex dependencies, resource pools, lock resources, and variables to orchestrate your ML workflows effectively
  • Parallel execution: Run up to 50 Amazon SageMaker jobs simultaneously per agent, allowing for efficient job execution at scale

Control-M for Amazon SageMaker compatibility

Before diving into how to set up Control-M for Amazon SageMaker, it’s essential to ensure that your environment meets the compatibility requirements:

  • Control-M/EM: version 9.0.20.200 or higher
  • Control-M/Agent: version 9.0.20.200 or higher
  • Control-M Application Integrator: version 9.0.20.200 or higher
  • Control-M Web: version 9.0.20.200 or higher
  • Control-M Automation API: version 9.0.20.250 or higher

Please ensure you have the required installation files for each prerequisite available.

A real-world example:

The Abalone Dataset, sourced from the UCI Machine Learning Repository, has been frequently used in ML examples and tutorials to predict the age of abalones based on various attributes such as size, weight, and gender. The age of abalones is usually determined through a physical examination of their shells, which can be both tedious and intrusive. However, with ML, we can predict the age with considerable accuracy without resorting to physical examinations.

For this exercise, we used the Abalone tutorial provided by AWS. This tutorial efficiently walks users through the stages of data preprocessing, training, and model evaluation using Amazon SageMaker.

After understanding the tutorial’s nuances, we trained the Amazon SageMaker model with the Abalone Dataset, achieving satisfactory accuracy. Further, we created a comprehensive continuous integration and continuous delivery (CI/CD) pipeline that automates model retraining and endpoint updates. This not only streamlined the model deployment process but also ensured that the Amazon SageMaker endpoint for inference was always up-to-date with the latest trained model.

Setting up Control-M for Amazon SageMaker

Now, let’s walk through how to set up Control-M for Amazon SageMaker, which has three main steps:

  1. Creating a connection profile that Control-M will use to connect to the Amazon SageMaker environment
  2. Defining an Amazon SageMaker job in Control-M that will define what we want to run and monitor within Amazon SageMaker
  3. Executing an Amazon SageMaker pipeline with Control-M

Step 1: Create a connection profile

To begin, you need to define a connection profile for Amazon SageMaker, which contains the necessary parameters for authentication and communication with SageMaker. Two authentication methods are commonly used, depending on your setup.

Example 1: Authentication with AWS access key and secret

Figure 1. Authentication with AWS access key and secret

Figure 1. Authentication with AWS access key and secret.

Example 2: Authentication with AWS IAM role from EC2 instance

Figure 2. Authentication with AWS IAM role

Figure 2. Authentication with AWS IAM role.

Choose the authentication method that aligns with your environment. It is important to specify the Amazon SageMaker job type exactly as shown in the examples above. Please note that Amazon SageMaker is case-sensitive, so make sure to use the correct capitalization.

Step 2: Define an Amazon SageMaker job

Once you’ve set up the connection profile, you can define an Amazon SageMaker job type within Control-M, which type enables you to execute Amazon SageMaker pipelines effectively.

Figure 3. Example AWS SageMaker job definition

Figure 3. Example AWS SageMaker job definition.

In this example, we’ve defined an Amazon SageMaker job, specifying the connection profile to be used (“AWS-SAGEMAKER”). You can configure additional parameters such as the pipeline name, idempotency token, parameters to pass to the job, retry settings, and more. For a detailed understanding and code snippets, please refer to the BMC official documentation for Amazon SageMaker.

Step 3: Executing the Amazon SageMaker pipeline with Control-M

It’s essential to note that the pipeline name and endpoint are mandatory JSON objects within the pipeline configuration. By executing the “ctm run” command on the pipeline.json file, it activates the pipeline’s execution within AWS.

First, we run “ctm build sagemakerjob.json” to validate our JSON configuration and then the “ctm run sagemakerjob.json” command to execute the pipeline.

Figure 4. Launching Amazon SageMaker job

Figure 4. Launching Amazon SageMaker job.

As seen in the screenshot above the “ctm run” command has launched the Amazon SageMaker job. The next screenshot shows the pipeline running from the Amazon SageMaker console.

Figure 5. View of data pipeline running in Amazon SageMaker console.

Figure 5. View of data pipeline running in Amazon SageMaker console.

In the Control-M monitoring domain, users have the ability to view job outputs. This allows for easy tracking of pipeline statuses and provides insights for troubleshooting any job failures.

Figure 6. View of Amazon SageMaker job output from Control-M Monitoring domain.

Figure 6. View of Amazon SageMaker job output from Control-M Monitoring domain.

Summary

In this blog, we demonstrated how to integrate Control-M with Amazon SageMaker to unlock the full potential of AWS ML services, orchestrating them effortlessly into your existing application and data workflows. This fusion not only eases the management of ML jobs but also optimizes your overall automation processes.

Stay tuned for more blogs on Control-M and BMC Helix Control-M integrations! To learn more about Control-M integrations, visit our website.

]]>
The Essential Role Orchestration Played in Brazilian Retailer Marisa’s Digital Transformation https://www.bmc.com/blogs/marisa-digital-transformation-with-controlm/ Wed, 20 Sep 2023 07:02:29 +0000 https://www.bmc.com/blogs/?p=53135 At Marisa, we are proud of our heritage of being a leading Brazilian retailer for 75 years. Much has changed since our founding, and one of the biggest recent changes happened within our IT systems. Our team was hired by the CIO to lead a major transformation project to break down silos and integrate data […]]]>

At Marisa, we are proud of our heritage of being a leading Brazilian retailer for 75 years. Much has changed since our founding, and one of the biggest recent changes happened within our IT systems. Our team was hired by the CIO to lead a major transformation project to break down silos and integrate data from all aspects of the business so we could better connect with our customers.

To get the new business insights we required, we knew we would need new tools. SAP® would remain our system of record, but beyond that we were not committed to keeping many elements of our current-generation system, which included Control-M. Our team had the freedom to completely modernize and select the software and other tools that would best fit our goals and future-proof our business. We used that flexibility create an infrastructure designed to take advantage of the powerful data, analytics, and reporting capabilities available today.

The result is a hybrid cloud environment with data being used simultaneously for analytics and other functions in multiple locations. Some of the key elements include:

  • SAP Business Warehouse (SAP BW)
  • Informatica
  • Data Lake on Amazon Web Services (AWS)
  • Azure Databricks
  • SQL Server
  • Amazon Redshift
  • Power BI
  • MicroStrategy
  • Airflow
  • Control-M

Yes, Control-M is on the list, despite our initial thought that we would have to replace it because so many workloads and systems were being updated or replaced.

After we identified many of the elements that would be essential in our responsive new architecture, we began to focus on how we could integrate and orchestrate it all. The complexity became frustrating as we learned about the limitations that each component had for integrating its workflows with others. We had counted on Airflow to solve those challenges, but it had its own limitations. That was the point where we realized Control-M was not part of the problem with our IT systems, it was an essential part of the solution.

Our modernization was driven by the principle of bringing together data from more sources, using best-of-breed solutions. We saw that the limitations of domain-specific tools would be a barrier to realizing our vision and getting the most complete insights possible. We then took a closer look and realized that Control-M was capable of doing much more than what we had been using for. That includes its many integrations with modern data and cloud technologies, so our staff could continue to work with their preferred tools, while allowing Control-M to orchestrate all the operations.

The daily executive report we produce is an excellent example of how everything comes together. Known in the corporate offices as “The Newspaper,” the report consists of a series of dashboards with data and visualizations that show all the leading business indicators and developments from the previous day. It shows daily sales by department and channel (physical stores, e-commerce) plus average receipts, margins, inventory levels, Net Promoter Score (NPS), supply chain updates, and much more. Like a real newspaper, the report relies on information from hundreds of sources and must be produced within strict service level agreements (SLAs). After stores close, we have seven hours to gather, process, and assimilate this data and deliver it to executives before their workday begins the next morning.

Various structured and unstructured data from point of sale (POS), customer relationship management (CRM), inventory, shipping, HR, and other systems is loaded into our SAP Business Warehouse. We use the data lake to produce 18 different reports that are customizable to different business operations and individuals. The process involves our enterprise resource planning (ERP) and all the other systems previously referenced.

Control-M plays the crucial role of being the overall orchestrator. Just for the file transfers to the data lake, Control-M executes 92 complex workflows that require integrations with 12 separate systems. Control-M’s out-of-the-box integrations with Amazon S3, Azure Databricks, Informatica Cloud, and SAP have been integral, as has its connection profiles that allow us to easily build integrations to other environments. We take advantage of the integration with Airflow to orchestrate our data pipelines, enabling our development and operations teams to use the tools they know best, with Control-M handling the orchestration. Control-M is highly scalable and ensures Airflow jobs run reliably in production.

Control-M doesn’t only connect all the pieces in our new environment, it continually monitors the workflows running across them to ensure we have no interruptions. We recently created a centralized enterprise monitoring center with integration between Control-M and our ITSM system at the core. As part of that process, we used Control-M to consolidate activities, thereby eliminating more than 200 recurring jobs. Control-M SLA Management proactively identifies potential workflow execution delays and initiates fixes or notifications. We built a feature that automatically issues a notification via WhatsApp to the appropriate business and operations staff if there is a potential issue with their critical jobs. Our environment is much more complex than it used to be, but we are more responsive and data-driven than ever.

These are some of the successes we’ve achieved in the first year of our transformation program. There’s much more we can do, and now we know Control-M will continue to support us as our systems modernize and our business evolves.

For more information on Control-M, visit bmc.com/controlm.

]]>
Generating Real Value from Data Requires Real Investment https://www.bmc.com/blogs/generating-real-value-from-data/ Fri, 21 Jul 2023 08:08:26 +0000 https://www.bmc.com/blogs/?p=53068 Scan any business or tech headline right now, and you’re likely to see artificial intelligence (AI) and machine learning (ML), and more specifically the rising niches of GPT and LLMs (generative pre-trained transformer and large language models). GPT and LLMs distill data and return content in natural language, whether as longform narrative, auto-populated answers to […]]]>

Scan any business or tech headline right now, and you’re likely to see artificial intelligence (AI) and machine learning (ML), and more specifically the rising niches of GPT and LLMs (generative pre-trained transformer and large language models). GPT and LLMs distill data and return content in natural language, whether as longform narrative, auto-populated answers to questions, or even imagery or videos, all at super-fast speeds. While there’s still much to sort through on what these technologies mean for business, tech, politics, ethics, and more, one thing is clear—they’re breaking new ground for data.

AI, GPT, and LLMs live and die by data. They analyze it, learn from it, and create it, both leveraging and adding to its already explosive growth. And right now, businesses are generating, accumulating, and retaining mountains of it—and spending a considerable amount of money to do so. But to what end?

According to IDC, “Despite spending $290 billion globally on technology and services to increase decision velocity, 42 percent of enterprises report that data is underutilized in their organizations,” and a recent IDC Global Data Sphere predicts that by 2026, seven petabytes of data will be created every second. Boston Consulting Group says the comprehensive costs around data are equally staggering, as “spending on data-related software, services, and hardware—which already amounts to about half a trillion dollars globally—is expected to double in the next five years.”

It’s time to put all that juicy data you’ve collected to work, and investing in AI technologies can help you get there. While GPT and LLM solutions are gaining a reputation for what they can create, they’re also being put into practice for DataOps practices and analytics solutions that can help you make sense of all that data in the first place. Today’s data is so complex that organizations cannot unravel it without the power of AI.

As I covered in my previous blog, DataOps is all about getting your arms around your data by improving data quality, gaining better business insights, and expanding innovation and cloud efficiency. AI and AI-derived technologies can help on all three fronts.

AI can be used to collate, contextualize, and analyze your hard-won proprietary data and then help you use it to learn about your business and your customers. With AI combing through data, you can uncover new insights that were previously inconceivable even a few years ago—and make informed decisions about which data is no longer needed, still missing, needs more details, and so on. From there, that data can be used to train GPT and LLM tools that advance and expand your business and become the targeted solutions and services your customers crave.

The Eckerson Group recently polled data practitioners on LinkedIn and discovered that 43 percent already use LLMs to assist data engineering. In a second poll, 54 percent said they use ChatGPT to help write documentation, 18 percent use it for designing and building pipelines, and another 18 percent are using it to learn new techniques.

Sitting on a mountain of data gets you nowhere if you don’t know what’s in it. With data accumulations surpassing our capacity to sort through, understand, quantify, and qualify it, investing in AI/ML technologies is the way forward. These technologies can help you dig into all that data and yield valuable insights to better understand your business, discover where to expand or change course, identify new opportunities, and ultimately deliver the solutions your customers and stakeholders want.

Making the most of GPT and LLMs relies on a solid data management foundation enabled by the people, process, and technology shifts of a DataOps strategy and methodology. Learn more about how organizations are yielding value from data in Profitable Outcomes Linked to Data-Driven Maturity, a BMC-commissioned study by 451 Research, part of S&P Global Market Intelligence.

]]>
Taking Steps to Unify Data for Maximum Value https://www.bmc.com/blogs/taking-steps-to-unify-data-for-maximum-value/ Mon, 15 May 2023 10:20:11 +0000 https://www.bmc.com/blogs/?p=52879 Businesses have been on a data collection kick for a while now, and it’s no surprise since IDC says we’ll generate around 221 zettabytes of data by 2026. But if your goal is to turn all that data into insights, where do you start? Do you know what you have? Is it the right data? […]]]>

Businesses have been on a data collection kick for a while now, and it’s no surprise since IDC says we’ll generate around 221 zettabytes of data by 2026. But if your goal is to turn all that data into insights, where do you start? Do you know what you have? Is it the right data? And, most importantly, is it yielding value for your business?

We commissioned 451 Research, part of S&P Global Market Intelligence, to survey 1,100 IT and data professionals from diverse global regions about what they want from their data, and the challenges they’re facing in achieving those goals. The findings are out now in Profitable Outcomes Linked to Data-Driven Maturity.

The survey revealed a handful of common issues that are impeding progress as businesses try to gather and present a unified view of their data. Among them:

  • Meeting the streaming or real-time requirements needed to support data collection from 24×7 business models and Internet of Things devices
  • Lack of automation, and a reliance on manual processes and legacy solutions
  • Data quality issues with collecting inaccurate and out-of-date information
  • Data silos and lack of system interoperability

Additionally, respondents said they need help determining the usability, trustworthiness, and quality of the information they’ve been gathering—and continue to gather—to maximize and optimize that data. If the data is incomplete or incorrect, an organization loses not only the time and effort required to gather and store it in the first place—it also puts itself at risk of noncompliance issues and strategic missteps that damage the bottom line.

Ensuring that you’re gathering the right data, and putting it to good use, requires a tool that can deliver a unified view. Automated capabilities are key to saving time and toil related to data processing, reducing errors, and delivering real-time visibility anytime from anywhere. BMC’s application workflow orchestration solutions, Control-M and BMC Helix Control-M, can help organizations optimize the data they’ve worked so hard to collect, and yield the most value from it.

Control-M simplifies application and data workflow orchestration on-premises or as a service. It makes it easy to build, define, schedule, manage, and monitor production workflows, ensuring visibility and reliability and improving service level agreements (SLAs). BMC Helix Control-M is a software-as-a-service (SaaS)-based solution that integrates, automates, and orchestrates complex data and application workflows across highly heterogeneous technology environments.

Both solutions support the implementation of DataOps, which applies agile engineering and DevOps best practices to the field of data management to better organize, analyze, and leverage data and unlock business value. With DataOps, DevOps teams, data engineers, data scientists, and analytics teams collaborate to collect and implement data-driven business insights.

Automating and orchestrating data pipelines with tools like Control-M and BMC Helix Control-M is integral to DataOps, and can help you yield value from your data and drive better business outcomes by:

  • Improving data quality: Once guardrails are in place to identify, collate, and analyze data, you’ll get a better sense of the data you have—and what you still need.
  • Gaining better business insights: Now that you’re collecting and analyzing the data you want—and not cluttering it with the data you don’t—it’s an easier task to leverage that information for targeted, revenue-generating activities.
  • Expanding innovation and cloud efficiency: With the cost savings achieved through data orchestration and better data processes, you can redirect spend toward innovation initiatives (informed by those very same data insights) that help grow the business.

You can read the full report, Profitable Outcomes Linked to Data-Driven Maturity, here. Visit bmc.com/controlm to learn more about Control-M and bmc.com/helixcontrolm to learn about BMC Helix Control-M.

]]>
Formula One’s Mark Gallagher Talks Data and Insights https://www.bmc.com/blogs/data-analytics-mark-gallagher/ Thu, 30 Mar 2023 09:41:43 +0000 https://www.bmc.com/blogs/?p=52674 In 2022, 5.7 million people attended Formula One races around the world, with revenue growing to $2.573 billion. Those two nuggets of intel about one of the biggest sports in the world are data points, and in our latest BMC Transformational Speaker Series, BMC VP of Sales Jeff Hardy and Oracle Senior Director of ISV […]]]>

In 2022, 5.7 million people attended Formula One races around the world, with revenue growing to $2.573 billion. Those two nuggets of intel about one of the biggest sports in the world are data points, and in our latest BMC Transformational Speaker Series, BMC VP of Sales Jeff Hardy and Oracle Senior Director of ISV Success Directors Dan Grant welcomed Formula One Racing Data Analyst Mark Gallagher for a wide-ranging discussion on how data, analytics, and insights are being used to improve efficiency, safety, and more for the organization’s drivers and vehicles. Here are a few excerpts from the conversation.

Mark shared that the organization’s technology evolution since 1950 has been iterative. “We started off by learning how to make cars go faster. We then embraced aerodynamics and learned how to make aircraft go faster, which is effectively what a Formula One car is today. It’s an inverted jet fighter,” he says. “And really the third suite of tools have been digital. And it’s extraordinary to really reflect on the fact that Formula One’s digital transformation has been taking place for more than half of its history.”

“Now to this day, all teams and particularly the more competitive teams [are] utilizing data and analytics. Formula One’s all about action. And we want insights. We want to go on a journey of knowledge rather than a journey of hope. We don’t want to hope we win. We want to know we’re going to win. And that’s where the actionable insights come from.”

While initial analytics revealed what was going right, Mark says he and his team wanted more, and better, data and insights. “We suddenly started wanting a deeper dive. What’s going to help us go faster? What’s going to help us manage risks better? What’s going to enable us to prevent negative outcomes and drive positive outcomes,” he explains. “And that there is the analytics space. Race car drivers haven’t changed very much over the years, but the ecosystem within which they’re operating is night and day difference, thanks to our data-driven environment.”

Mark acknowledged Formula One has had its share of negative outcomes over its history, with fatal accidents occurring on the track every year or multiple accidents per year several years ago, and says the advances they have made through technology mean that younger drivers have never experienced that. “One of the really big changes over the last quarter of a century has been the improvement in our risk management, our ability to use real-time data to spot trends, to analyze failure modes developing, to look at diagnostics in real time and say, ‘Actually, there’s a problem developing,’” he points out.

“When you look at a lot of accidents, they’re caused by a failure, component failure that’s caused by a particular issue arising. In many cases … back in the day, we couldn’t do anything about [that]. But now we can. We can instantaneously, if necessary, call a halt to operations. That doesn’t happen very often, but if we need to, we can call the whole operation.”

“We can manage the lifecycle of up to 80,000 components that we’re going through on the car through the year. And that means that every single aspect and in total, granular detail is being managed so effectively to ensure that we get optimized performance and that risks are minimized. So, when people ask what’s been the big change for me, there is an enormous change.”

Mark adds that airlines have done the same thing. “We’re not the first industry to do that. Every time we get in an aircraft, we’re getting on board something that is inherently safe because of the culture of examining data, forensically examining data from past events in order to ensure that future outcomes are possible,” he explains. “So again, Formula One has looked at aviation and aerospace and said, ‘That’s the level of engineering we’re going to move to.’ And data-driven tools have been integral to that evolution.”

Mark says that between 1950 and 2000, about 45 percent of the time, Formula One cars simply failed due to mechanical issues, and that’s also now become a thing of the past. “Considering that we pride ourselves as engineering companies, we actually weren’t very good at building robust, reliable technology. Today, Formula One world champions can realistically expect to go through a whole season without suffering a single mechanical or technical failure,” he shares.

“I think Lewis Hamilton … went four and a half years without a single significant technical failure. I’ve never had a road car that’s lasted four and a half years with robust technology. So, in Formula One, that quality, that reliability, the foundation stones of the quality of our engineering and outcomes, that has been made possible by our data-driven environment. We’re no longer hoping everything’s going to be okay. We know what the outcome can be.”

If you’re a fan of movies about race cars, you’re familiar with the big moments where a trainer or a coach clicks the stopwatch to marvel at how fast the car got around the track. Mark says they’re light years beyond that. “When we look at the metrics that we are interested in, it all started with the humble stopwatch, [and us wondering], ‘How can we get from A to B faster than our competitors?’” he says.

“How do we get from where we are now to where we want to be more efficiently than our competition? [Now], we are looking at thousands of parameters. We’re talking about 300 sensors on the car, maybe … 1,200 channels of data. The cars are generating … about ten gigabytes of data per lap and a couple of terabytes of data over the weekend from the car. And in terms of the KPIs …we’ve got a ton of people looking at all the metrics. We know the pressures of the tires, we know the temperatures of the tires, we know everything that the human being driving the car is doing.”

“Most of what you’re looking at is all doing fine, but we are really interested in the opportunities, and this is where the actionable insight’s kick in, because [when] you get an anomaly … it’s amazing what we can do. And in its most extreme form, let’s say you were leading the race and you had three laps of the race remaining, and an issue develops on a particular system, we can monitor that system. We can talk to the driver and say, ‘Can you modify your driving because this system is beginning to fail,’ or ‘There’s an issue with this system.’ We might even tell the driver to switch off a particular system. We’ve even had occasions where we’ve had Formula One drivers do the equivalent of control, alt, delete on their steering wheel and literally reset everything and it’s cured a problem.”

“In terms of the metrics that we’re interested in, it’s anything that’s going to show us our benchmark performance against the competition and where the opportunities and the anomalies … and the risks are that we can really dig into some detail that’s going to help us to improve performance.”

“This is why when you see Formula One race car drivers being interviewed after a qualifying session or after a race, they very often say, ‘We’ve got to look at the data.’ And what they’re actually saying is, ‘We’ve got to look at where that issue lies [or] where that opportunity lies.’ It’ll always be the thing or the items which are going to lead to us getting a performance improvement. So, our data-driven environment from the driver’s perspective, is all about managing and gaining insights into any metrics that are going to supercharge our continuous improvement as a race team.”

BMC is keenly interested in the world of data and analytics, and helping companies pursue and improve their data strategies. We recently commissioned 451 Research, part of S&P Global Market Intelligence, to survey 1,100 IT and data professionals from diverse global regions, and those finding have just been released in a new report, Profitable Outcomes Linked to Data-Driven Maturity. You can also check out our deep dive into the world of DataOps here.

To learn more about how the need for speed and the need for data is driving the future of Formula One, Mark’s thoughts on artificial intelligence and autonomous vehicles, and how one driver was so thirsty for data he was checking the screens around the track while driving 200 MPH during a race.

]]>
Leveraging Data to Deliver a Transcendent Customer Experience https://www.bmc.com/blogs/value-of-data-customer-experience/ Tue, 14 Mar 2023 10:18:03 +0000 https://www.bmc.com/blogs/?p=52705 Customer satisfaction can make or break your business. So, are you collecting and using relevant data to drive meaningful change for your customers’ interactions with your business? We wanted to find out how companies are using—and maximizing—their data to yield value, so we commissioned 451 Research, part of S&P Global Market Intelligence, to survey 1,100 […]]]>

Customer satisfaction can make or break your business. So, are you collecting and using relevant data to drive meaningful change for your customers’ interactions with your business? We wanted to find out how companies are using—and maximizing—their data to yield value, so we commissioned 451 Research, part of S&P Global Market Intelligence, to survey 1,100 IT and data professionals from diverse global regions. Those finding have just been released in a new report, Profitable Outcomes Linked to Data-Driven Maturity.

Supporting the customer experience is becoming a key focus in the contemporary use of enterprise data, and strong data practices are integral to delivering a Transcendent Customer Experience, one of the tenets of the Autonomous Digital Enterprise, that meets customers where, when, and how they want to be met, providing customer engagement and satisfaction that lead to long-term business profitability.

Fifty-five percent of survey respondents are focused on improving their customer satisfaction levels through the effective use of data. In an increasingly, pervasively online world, the report asserts that failing to capture and understand the context of data derived from customer interactions via digital channels and data-driven mediums “is akin to leaving money on the table.” Over the next 24 months, one-fifth of survey respondents expect customer satisfaction to be the single area of most significant improvement in their data strategy evolution.

The types of data gathered from and about customers can be used to inform and influence different aspects of their overall experience, empowering businesses to:

  • Personalize offerings tailored to specific customer profiles, preferences, and previous purchases
  • Identify and correct service issues through customer surveys and self-service solutions
  • Forecast and respond to trends, adjusting supply chains to meet customer demand
  • Incentivize customers with loyalty and rewards programs based on engagement and purchases

Seventy percent of those surveyed said they were highly effective or mostly effective at leveraging data-driven insights for customer-facing processes such as onboarding and signups, while 74 percent said they were highly effective or mostly effective using those insights to help ensure customer service (finding products, placing orders, providing delivery status, etc.).

To make data useful to the business, organizations must be able to have a unified view of their data, as well as automated tools and processes to better manage and organize it; verify its quality; analyze its usefulness; and ensure that it flows to the right place at the right time for faster decision making. The right mix of people, processes, and technology is essential to ensure a democratized data culture and develop true data maturity.

To do this, organizations must take a holistic, enterprise-wide view of their data assets and activity, implementing a DataOps methodology that applies agile and automated approaches to data management to support data-driven business outcomes and leverages appropriate supporting technology to optimize business processes and people. DataOps represents a culture and technology shift. Among organizations with a more mature DataOps strategy, 77 percent indicated that their organization’s use of data has had a most significant impact to date on customer satisfaction, versus 65 percent among total respondents.

To learn more about how DataOps and data maturity can help organizations deliver a Transcendent Customer Experience and tangible benefits of data maturity across the business, visit bmc.com/valueofdata.

]]>
Diving Deep into All Things Data with Dr. Tricia Wang https://www.bmc.com/blogs/diving-deep-into-thick-data-with-dr-tricia-wang/ Wed, 21 Dec 2022 00:00:47 +0000 https://www.bmc.com/blogs/?p=52501 We know we’re living in an increasingly data-driven world, but have you ever wondered what all that data says about us? Dr. Tricia Wang has, and BMC CTO Ram Chakravarti and AWS Senior Partner Development Manager Vijaya Balakrishna were honored to recently host a new Transformational Speaker webinar with the global tech ethnographer, researcher, and […]]]>

We know we’re living in an increasingly data-driven world, but have you ever wondered what all that data says about us? Dr. Tricia Wang has, and BMC CTO Ram Chakravarti and AWS Senior Partner Development Manager Vijaya Balakrishna were honored to recently host a new Transformational Speaker webinar with the global tech ethnographer, researcher, and popular TED Talk speaker to get her take on “thick data”—the human element invisible to quantitative data analysis. Here are some highlights of the conversation.

Looking beyond the qualitative

Dr. Wang kicked off the discussion explaining that what’s most interesting to her is going beyond the basic definition that most people think of when they use the term “data.” “We all know what big data is. It’s numbers that are put in spreadsheets that are put in data lakes, data warehouses, all that stuff…that you can then do math on. But the world is not just numbers. There are other ways to represent the world and to represent the…processes that we’re interested in, especially that businesses are interested in,” she explains.

“If you think about in your everyday life, you don’t make decisions just based on numbers. We make decisions based on a holistic picture, and so numbers are taken into account, [but] you’re also looking at non-quantitative indicators.” She points to people’s reliance on smart watches and Fitbits  as monitoring tools while still paying attention to general indicators such as how they actually feel.

She explains that it’s been her life’s work educating business leaders who want to stay ahead on how to yield value from non-quantitative data and leverage it as an indicator to drive actionable insights and make their businesses more agile and customer centric. And she says that while automation is gaining prominence, it’s not the only tool that you should have in your toolbox.

“Any qualified, legit business leader is fully aware that you need to have domain expertise and that domain expertise holds so much thick data,” she says. “There’s so much thick data that’s required to manage relationships…to read[ing] a room [and] knowing how to present a story to get buy in. The higher you go up, you’re really just…trying to convince people to see your point of view and get to some aligned outcome. And that requires thick data. You can’t just convince people based on quantitative numbers.”

“You need to have a human in the loop and really understand the human model, or the landscape that you are trying to understand, [or] the business outcomes. And then you can look at the variables that are going to help you understand that human model [and] build your data model based on the human model.” She adds that she tells her CEOs to stop treating their CIOs and CTOs like technical partners and help them understand their role as cultural change agents for the entire business.

Data in a brave new world

Dr. Wang turned the discussion toward the pandemic, highlighting that it’s had a significant impact not just on people in their personal and professional lives, but businesses, too. “Business ethnography now is about observing customers, their needs and motivations, in their own environment, in their own cultural setting. The whole world has changed and it’s still changing. We live in an incredible time where we talk about digital transformation and businesses leading digital transformation. The reality is front facing digital transformation was not what…it wasn’t really until the pandemic really forced the whole world online,” she says.

“One of the biggest changes is what I call the spatial collapse, where before, for centuries, since the Industrial Revolution, we lived in a way that has been more or less separated by three spaces. Home is the first space, the second space is the workspace, and then the third space is everything you do outside of that, from having fun, seeing strangers, meeting with your friends. And what the pandemic did was collapse all the three spaces into one. That is one of the biggest trends for changing customer behavior and the psychology of it.”

“The mental models of how they build their lives has radically changed. And it doesn’t just mean it’s customer behavioral change. It means employees change. The way we recruit talent changes. People have different expectations when there’s a spatial collapse.” She points out that’s why so many employees bristled at return to office mandates.

A new (siloed) world view

Dr. Wang also points out that in addition to the pandemic, sociopolitical and geopolitical events like the global supply chain shortage and the war in Ukraine have all had ripple effects from “how someone is staying warm over the night, and how they’re getting food” because we are in “a very interconnected world.” And while that connectedness holds true for the real world, people are shifting more toward silos online, which in turn is disrupting how digital business finds its customers.

“We’re going to see more polarization. I’m not anti-social media, but I do think [its] algorithms have been optimized for moving people into polarization and two extreme ends. Having safe spaces for conversation becomes much harder…which means that a lot of people are moving off social media, off of public spaces, into smaller walled gardens like private conversation groups like WhatsApp or Signal,” she explains.

“What this means for businesses, in this era, [is] that you have to do a better job on getting to know your customer. First-party data becomes even more valuable. However, customers or…people are much more aware of the value of their personal data to a [billion-dollar] market, and people want a piece of that. [It] requires trusting relationships [and] the business has to change the way they collect data and communicate data…and show transparently what [they’re] doing with that data.”

A shifting focus

Another effect of the pandemic is that people moved to new cities and towns—and reexamined how they live, elevating larger discussions around environmental, social, and governance (ESG); corporate social responsibility (CSR); and sustainability. People now want much more transparency from businesses about how they are going to influence their lives, contribute to the world overall, and make it a better place—and they’re redefining what “better” means.

“One of the things we’re seeing is the change in supply chain strategy, because people are saying, ‘more local.’ (…) They have a new kind of relationship with those around them, the city or the space around them, and their shops are connected to their local economies in different ways. There’s a lot happening around regionalization of supply chains to keep things more stable. People are much more aware of a disruptive supply chain and really thinking through, ‘ ‘Where does my food come from?’ ‘What does my clothing come from?’ ‘What kind of stores or businesses am I supporting?’ Big topic issues that maybe weren’t as top of mind, like the climate and food security, will become even more important.”

As Dr. Wang shared, the concept of data is far-reaching, and automation can help you get a handle on it and yield its maximum value. To learn more about automating your entire big data lifecycle from end to end—and cloud to cloud—to deliver insights more quickly, easily, and reliably, download our e-book.

]]>
Orchestrate and Automate to Make DataOps Successful https://www.bmc.com/blogs/dataops-orchestration/ Mon, 03 Oct 2022 08:29:14 +0000 https://www.bmc.com/blogs/?p=52296 DataOps is intended to smooth the path to becoming a data-driven enterprise, but some roadblocks remain. This year, according to a new IDC InfoBrief sponsored by BMC, DataOps professionals reported that on average, only 58 percent of the data they need to support analytics and decision making is available. How much better would decision-making be, […]]]>

DataOps is intended to smooth the path to becoming a data-driven enterprise, but some roadblocks remain. This year, according to a new IDC InfoBrief sponsored by BMC, DataOps professionals reported that on average, only 58 percent of the data they need to support analytics and decision making is available. How much better would decision-making be, and how much business value would be created, if the other 42 percent of the data could be factored into decisions as intended? It seems logical to assume it would be almost twice as good!

That raises another question: Why can’t organizations get the data that they already have where they need it, when they need it? In most cases, the answer comes down to complexity.

A previous blog by my colleague, Basil Faruqui, introduced why DataOps is important. This one follows up to highlight what is needed. Spoiler alert: The ability to orchestrate multiple data inputs and outputs is a key requirement.

The need to manage data isn’t new, but the challenges of managing data today to meet business needs is changing very fast. Organizations now rely on more data sources than ever before, along with the technology infrastructure to acquire, process, analyze, communicate, and store the data. The complexity of creating, managing, and quality-assuring a single workload increases exponentially as more data sources, data consumers (both applications and people), and destinations (cloud, on-premises, mobile devices, and other endpoints, etc.) are included

DataOps is helping manage these pathways, but is also proving to have some limitations. The IDC InfoBrief found integration complexity is the leading obstacle to operationalizing and scaling DataOps and data pipeline orchestration. Other obstacles include a lack of internal skills and time to solve data orchestration challenges, and difficulty using the available tooling. That means that for complex workloads like those shown above, organizations can’t fully automate the planning, scheduling, execution, and monitoring because the complexity causes gaps, which in turn cause delays. This results in decisions being made based on incomplete or stale data, thus limiting business value and hampering efforts to becoming a data-driven enterprise.

Complexity is a big problem. It is also a solvable one. Orchestration, and more specifically, automating orchestration, are essential to reducing complexity and enabling scalability, unlike scripting and other workarounds. Visibility into processes, self-healing capabilities, and user-friendly tools also make complexity manageable. As IDC notes in its InfoBrief, “Using a consistent orchestration platform across applications, analytics, and data pipelines speeds end-to-end business process execution and improves time to completion.”

Some of the most important functionality that is needed to achieve orchestration includes:

  • Built in connectors and/or integration support for a wide range of data sources and environments
  • Support for an as-code approach so automation can be embedded into the deployment pipelines
  • Complete workflow visibility across a highly diverse technology stack
  • Native ability to identify problems and remediate them when things go wrong

Tooling that is specific to a software product, development environment, or hyperscale platform may provide some of that functionality, but typically isn’t comprehensive enough to cover all the systems and sources the workflow will touch. That’s one reason so many DataOps professionals report that tooling complexity hinders their efforts.

Control-M can simplify DataOps because it works across and automates all elements of the data pipeline, including extract, transform, load (ETL), file transfer, and downstream workflows. Control-M is also a great asset for DataOps orchestration because:

  • It eliminates the need to use multiple file transfer systems and schedulers.
  • It automatically manages dependencies across sources and systems and provides automatic quality checks and notifications, which prevents delays from turning into major logjams and job failures further downstream.

Here are a couple quotes from Control-M users that illustrate its value. A professional at a healthcare company said, “Control-M has also helped to make it easier to create, integrate, and automate data pipelines across on-premises and cloud technologies. It’s due to the ability to orchestrate between workflows that are running in the cloud and workflows that are running on-prem. It gives us the ability to have end-to-end workflows, no matter where they’re running.”

Another user, Railinc, said, “The order in which we bring in data and integrate it is key. If we had to orchestrate the interdependencies without a tool like Control-M, we would have to do a lot of custom work, a lot of managing. Control-M makes sure that the applications have all the data they need.” You can see the full case study here.

These customers are among the many organizations that have reduced the complexity of their DataOps through automation. The IDC InfoBrief compares enterprises that excel at DataOps orchestration to those that don’t and found advantages for the leaders in multiple areas, including compliance, faster decision-making and time-to-innovation, cost savings, and more.

Can you orchestrate similar results at your organization? Learn more about Control-M for Data Pipeline Orchestration here and register for a free trial.

 

]]>
Improve SAP® system performance with automated data archiving https://www.bmc.com/blogs/improve-sap-system-performance/ Thu, 29 Sep 2022 09:17:43 +0000 https://www.bmc.com/blogs/?p=52289 Remember when we used to talk about big data? Volume, variety, veracity, and velocity—the metrics were mind boggling. Today we are actually living in that reality. Businesses are generating, collecting, and trying to manage and analyze more data than we ever thought possible. Sales data, check. Market data, check. Internet of Things (IoT) data, systems […]]]>

Remember when we used to talk about big data? Volume, variety, veracity, and velocity—the metrics were mind boggling. Today we are actually living in that reality. Businesses are generating, collecting, and trying to manage and analyze more data than we ever thought possible. Sales data, check. Market data, check. Internet of Things (IoT) data, systems of record, social media data…check, check, check.

Data is everywhere, and it’s only growing (in size and importance). So much, in fact, that the term “big data” is basically dead. Now it’s just data. Every company is focused on turning all this data into insights. And that’s great; it’s pushing the boundaries of what’s possible. But what do you do when all that data starts aging? Many companies are struggling because they still haven’t developed effective data archiving strategies.

This can be especially true for companies with large SAP® installations. Why? Because, over time, SAP implementations generate tons of data. Often, the data is sitting on production instances, slowing jobs and process and development cycles. Explosive data growth in SAP systems causes deterioration of application performance and user productivity, and it generates higher costs due to large tier-1 enterprise storage volumes (with redundancies) and higher administration costs due to long backup/maintenance windows and SAP upgrades.

So, why don’t companies just archive all this data? The simple answer: because it’s not that easy. Here are a few common challenges:

  • Consensus: Stakeholders across the organization often can’t agree on retention policies. Data archiving affects many groups, including IT operations (ITOps), business users, functional and technical SAP teams, and legal and compliance teams, etc.
  • Rules and regulations: There are a lot of rules to consider, including audit guidelines, industry-specific Food and Drug Administration (FDA) regulations and many more.
  • Future availability: And we can’t forget the fear factor. Everyone asks, “What happens when we need this data and it’s no longer available?”

SAP recommends that companies archive data regularly. But it’s critical to take a comprehensive, automated approach. Built-in and home-grown archiving tools are often limited in scope. For example, SAP includes some of its Information Lifecycle Management (ILM) functionality in its standard NetWeaver technology platform, but a separate ILM license must be purchased to use it for other types of data.

Meanwhile, enterprises already have solutions in place for archiving and file transfers for their non-SAP data, so the separate ILM license could be considered a redundant cost. Many legacy solutions have similar limitations on the data types they can handle and the applications, storage, and other infrastructure components they can work with, which has made multi-tool environments common.

Beyond that, the impending SAP ERP Central Component (ECC) end-of-life and SAP S/4HANA® becoming the favored form of SAP demands a modern approach to data archiving. Fortunately, Control-M can help. Control-M is an SAP-certified solution that creates and manages SAP ECC, SAP S/4HANA, and SAP Business Warehouse and data archiving jobs, and can support any application in the SAP ecosystem. This eliminates time, complexity, and the requirement for specialized knowledge. It can also be used for all other enterprise jobs, services, processes, and workflows. That lets organizations using SAP build, orchestrate, run, and manage all their enterprise jobs from a consolidated, integrated platform that provides visibility across all enterprise workflows and their dependencies. The result?

  • SAP system performance and response times improve, which reduces hardware and administrative costs.
  • System availability increases, resulting in less downtime during release upgrades.
  • Employees across the organization get better access to data and documents.
  • Archived data will be compressed and stored in archive files (accessible through application reports or through an archive information system like SAP Archive Information System).
  • Companies will be better positioned against security threats and for audit/compliance requirements.
  • Data archiving will help organizations accelerate their journey to SAP S/4HANA.

Want to learn more about how Control-M can help your organization streamline data archiving processes? Check out this white paper.

SAP, SAP S/4HANA are the trademark(s) or registered trademark(s) of SAP SE or its affiliates in Germany and in several other countries.

]]>
How to orchestrate a data pipeline on Google Cloud with Control-M from BMC https://www.bmc.com/blogs/orchestrate-a-data-pipeline/ Thu, 22 Sep 2022 16:17:22 +0000 https://www.bmc.com/blogs/?p=52266 The Google Cloud Platform is designed specifically to accommodate organizations in a variety of positions along their cloud services journey, from large-scale, machine learning (ML) and data analysis to services tailored to SMB to hybrid-cloud solutions for customers that want to use services from more than one cloud provider. When BMC was migrating our Control-M […]]]>

The Google Cloud Platform is designed specifically to accommodate organizations in a variety of positions along their cloud services journey, from large-scale, machine learning (ML) and data analysis to services tailored to SMB to hybrid-cloud solutions for customers that want to use services from more than one cloud provider. When BMC was migrating our Control-M application to this cloud ecosystem, we had to be very thoughtful about how we managed this change. The SADA engineering team worked alongside the BMC team to ensure that we had a seamless integration for our customers.

SADA supported this project by providing an inventory of the Google Cloud configuration options, decisions, and recommendations to enable the data platform foundation deployment, collaborated with BMC on the implementation planning, provided automation templates, and designed the Google Cloud architecture for the relevant managed services on the Google Cloud Platform.

In this article, we will discuss the end-result of this work, and look at an example using a credit-card fraud detection process to show how you can use Control-M to orchestrate a data pipeline seamlessly in Google Cloud.

Five orchestration challenges

There are five primary challenges to consider when streamlining the orchestration of an ML data pipeline:

  • Understand the workflow. Examine all dependencies and any decision trees. For example, if data ingestion is successful, then proceed down this path; if it is not successful, proceed down that path.
  • Understand the teams. If multiple teams are involved in the workflow, each needs to have a way to define their workflow using a standard interface, and to be able to merge their workflows to make up the pipeline.
  • Follow standards. Teams should use repeatable standards and conventions when building workflows. This avoids having multiple jobs with identical names. Each step should also have a meaningful description to help clarify its purpose in the event of a failure.
  • Minimize the number of tools required. Use a single tool for visualization and interaction with the pipeline (and dependencies). Visualization is important during the definition stage since it’s hard to manage something that you can’t see. This is even more important when the pipeline is running.
  • Include built-in error handling capabilities in the orchestration engine. It’s important to understand how errors can impact downstream jobs in the workflow or the business service level agreement (SLA). On the same note, failure of a job should not halt the pipeline altogether and involve human interaction. Criteria can be used to determine if a failed job can be restarted automatically or whether a human must be contacted to evaluate the failure, if, for instance, there are a certain number of failures involving the same error.

Meeting the challenge

Meeting these orchestration challenges required a solid foundation and also presented opportunities for collaboration. BMC and SADA aligned using the SADA POWER line of services to establish the data platform foundation. Some notable elements in this technical alignment included work by SADA to:

  • Apply industry expertise to expedite BMC’s development efforts.
  • Establish a best practices baseline around data pipelines and the tools to orchestrate them.
  • Conduct collaborative sessions in order to understand BMC’s technical needs and provide solutions that the BMC team could integrate and then expand upon.

SADA’s Data Platform Foundation provided opportunities to leverage Google Cloud services to accomplish the complex analytics required of an application like Control-M. The BMC and SADA teams worked together to establish a strong foundation for a robust and resilient solution through:

  • Selecting data and storage locations in Google Cloud Storage.
  • Utilizing the advantages provided by Pub/Sub to streamline the analytics and data integration pipelines.
  • Having thorough discussions around the extract, transform, and load (ETL) processes to truly understand the end state of the data.
  • Using BigQuery and writing analytic queries.
  • Understanding the importance of automation, replicability of processes, and monitoring performance in establishing a system that is scalable and flexible.
  • Using Data Studio to create a visualization dashboard to provide the necessary business insights.

Real-world example

Digital transactions have been increasing steadily for many years, but that trend is now coupled with a permanent decline in the use of cash as people and businesses practice physical distancing. The adoption of digital payments for businesses and consumers has consequently grown at a much higher rate than previously anticipated, leading to increased fraud and operational risks.

With fraudsters improving their techniques, companies are relying on ML to build resilient and efficient fraud detection systems.

Since fraud constantly evolves, detection systems must be able to identify new types of fraud by detecting anomalies that are seen for the first time. Therefore, detecting fraud is a perpetual task that requires constant diligence and innovation.

Common types of financial fraud that customers work to prevent with this application include:

  • Stolen/fake credit card fraud: Transactions made using fake cards, or cards belonging to someone else.
  • ATM fraud: Cash withdrawals using someone else’s card.

Fraud detection is composed of both real-time and batch processes. The real-time process is responsible for denying a transaction and possibly placing a hold on an account or credit card, thus preventing the fraud from occurring. It must respond quickly, sometimes at the cost of reduced accuracy.

To minimize false positives, which may upset or inconvenience customers, a batch phase is used to continuously fine-tune the detection model. After transactions are confirmed as valid or fraudulent, all recent events are input to the batch process on a regular cadence. This batch process then updates the training and scoring of the real-time model to keep real-time detection operating at peak accuracy. This batch process is the focus of this article.

Use our demo system

SADA and BMC created a demonstration version of our solution so you can experiment with it on Google Cloud. You can find all of our code, plus sample data, in GitHub.

Resources included are:

  • Kaggle datasets of transaction data, fraud status, and demographics
  • Queries
  • Schema
  • User-defined functions (UDFs)

How it works

For each region in which the organization operates, transaction data is collected daily. Details collected include (but are not limited to):

  • Transaction details. Describes each transaction, including the amount, item code, location, method of payment, and so on.
  • Personal details. Describes the name, address, age, and other details about the purchaser.

This information is pulled from corporate data based on credit card information and real-time fraud detection that identifies which transactions were flagged as fraudulent.

New data arrives either as batch feeds or is dropped into Cloud Storage by Pub/Sub. This new data is then loaded into BigQuery by Dataflow jobs. Normalization and some data enrichment is performed by UDFs during the load process.

Once all the data preparation is complete, analytics are run against the combined new and historical data to test and rank fraud detection performance. The results are displayed in Data Studio dashboards.

Control-M orchestration

Figure 1: Control-M orchestration

Google Cloud services in the pipeline

Cloud storage provides a common landing zone for all incoming data and a consistent input for downstream processing. Dataflow is Google Cloud’s primary data integration tool.

SADA and BMC selected Big Query for data processing. Earlier versions of this application used Hadoop, but while working with the team at SADA, we converted to BigQuery as this is the recommended strategy from Google for sophisticated Data Warehouse or Data Lake applications. This choice also simplified setup by providing out-of-the-box integration with Cloud Dataflow. UDFs provide a simple mechanism for manipulating data during the load process.

Two ways to define pipeline workflows

You can use Control-M to define your workflow in two ways:

  • Using a graphical editor. This provides the option of dragging and dropping the workflow steps into a workspace and connecting them.
  • Use RESTful APIs. Define the workflows using a jobs-as-code method, then use JSON to integrate into a continuous integration/continuous delivery (CI/CD) toolchain. This method improves workflow management by flowing jobs through a pipeline of automated building, testing, and release. Google Cloud provides a number of developer tools for CI/CD, including Cloud Build and Cloud Deploy.

Defining jobs in the pipeline

The basic Control-M execution unit is referred to as a job. There are a number of attributes for every job, defined in JSON:

  • Job type. Options include script, command, file transfer, Dataflow, or BigQuery.
  • Run location. For instance, which host is running the job?
  • Identity. For example, is the job being “run as…” or run using a connection profile?
  • Schedule. Determines when to run the job and identifies relevant scheduling criteria.
  • Dependencies. This could be things like whether the job must finish by a certain time or output must arrive by a certain time or date.

Jobs are stored in folders and the attributes discussed above, along with any other instructions, are applied to all jobs in that folder.

We can see in the code sample below an example of the JSON code that describes the workflow used in the fraud detection model ranking application. You can find the full JSON code in the Control-M Automation API Community Solutions GitHub repo. While there, you can also find some solutions, the Control-M Automation API guide, and other code samples in the same repository.

{
"Defaults" : {
},
"jog-mc-gcp-fraud-detection": {"Type": "Folder",
"Comment" : "Update fraud history, run, train and score models",
"jog-gcs-download" : {"Type" : "Job:FileTransfer",…},
"jog-dflow-gcs-to-bq-fraud": {"Type": "Job:Google DataFlow",…},
"jog-dflow-gcs-to-bq-transactions": {"Type": “Job:Google DataFlow",…},
"jog-dflow-gcs-to-bq-personal": {"Type": "Job:Google DataFlow",…},
"jog-mc-bq-query": {"Type": "Job:Database:EmbeddedQuery", …},
"jog-mc-fm-service": {"Type": "Job:SLAManagement",…},
},
"flow00": {"Type":"Flow", "Sequence":[
"jog-gcs-download",
"jog-dflow-gcs-to-bq-fraud",
"jog-mc-bq-query",
"jog-mc-fm-service"]},
"flow01": {"Type":"Flow", "Sequence":[
"jog-gcs-download",
"jog-dflow-gcs-to-bq-transactions",
"jog-mc-bq-query", "jog-mc-fm-service"]},
"flow02": {"Type":"Flow", "Sequence":[
"jog-gcs-download",
"jog-dflow-gcs-to-bq-personal",
"jog-mc-bq-query",
"jog-mc-fm-service"]}

}
}

The jobs shown in this workflow correspond directly with the steps illustrated previously in Figure 1.

The workflow contains three fundamental sections:

  • Defaults. These are the functions that apply to the workflow. This could include details such as who to contact for job failures or standards for job naming or structure.
{  "Defaults" : {"RunAs" : "ctmagent", "OrderMethod": "Manual", "Application" : 
       "multicloud", "SubApplication" : "jog-mc-fraud-modeling", 
      "Job" : {"SemQR": { "Type": "Resource:Semaphore", Quantity": "1"},
      "actionIfError" : {"Type": "If", "CompletionStatus":"NOTOK", "mailTeam": 
          {"Type": "Mail", "Message": "Job %%JOBNAME failed", "Subject": 
                 "Error occurred", "To": deng_support@bmc.com}}}
    }, 

  • Job definitions. This is where individual jobs are specified and listed. See below for descriptions of each job in the flow.
  • Flow statements. These define the relationships of the job, both upstream and downstream.
"flow00": {"Type":"Flow", "Sequence":["jog-gcs-download", 
           "jog-dflow-gcs-to-bq-fraud", "jog-mc-bq-query", 
           "jog-mc-fm-service"]},
"flow01": {"Type":"Flow", "Sequence":["jog-gcs-download", 
           "jog-dflow-gcs-to-bq-transactions", 
           "jog-mc-bq-query", "jog-mc-fm-service"]},
"flow02": {"Type":"Flow", "Sequence":["jog-gcs-download", 
           "jog-dflow-gcs-to-bq-personal", "jog-mc-bq-query", 
           "jog-mc-fm-service"]} 

Scheduling pipeline workflows

Control-M uses a server-and-agent model. The server is the central engine that manages workflow scheduling and submission to agents, which are lightweight workers. In the demo described in this article, the Control-M server and agent are both running on Google Compute Engine VM instances.

Workflows are most-commonly launched in response to various events such as data arrival but may also be executed automatically based on a predefined schedule. Schedules are very flexible and can refer to business calendars; specify different days of the week, month, or quarter; define cyclic execution, which runs workflows intermittently or every “n” hours or minutes; and so on.

Processing the data

File Transfer job type

Looking at the first job (Figure 2), called jog-gcs-download, we can see that this job, of the type Job:FileTransfer, transfers files from a conventional file system described by ConnectionProfileSrc to Google Cloud Storage described by ConnectionProfileDest.

The File Transfer job type can watch for data-related events (file watching) as a prerequisite for data transfer, as well as perform pre/post actions such as deletion of the source after a successful transfer, renaming, source and destination comparison, and restart from the point of failure in the event of an interruption. In the example, this job moves several files from a Linux® host and drops them into Google Cloud Storage buckets.

"jog-gcs-download" : {"Type" : "Job:FileTransfer",
        "Host" : "ftpagents",
        "ConnectionProfileSrc" : "smprodMFT",
        "ConnectionProfileDest" : "joggcp",
        "S3BucketName" : "prj1968-bmc-data-platform-foundation",
        "Description" : "First data ingest that triggers downstream applications",
        "FileTransfers" : [
          {
            "TransferType" : "Binary",
            "TransferOption" : "SrcToDestFileWatcher",
            "Src" : "/bmc_personal_details.csv",
            "Dest" : "/bmc_personal_details.csv"
          },
          {
            "TransferType" : "Binary",
            "TransferOption" : "SrcToDestFileWatcher",
            "Src" : "/bmc_fraud_details.csv",
            "Dest" : "/bmc_fraud_details.csv"
          },
          {
            "TransferType" : "Binary",
            "TransferOption" : "SrcToDestFileWatcher",
            "Src" : "/bmc_transaction_details.csv",
            "Dest" : "/bmc_transaction_details.csv"
          } 
        ]
      }, 

Dataflow

Dataflow jobs are executed to push the newly arrived data into BigQuery. The jobs appear complex, but Google Cloud provides an easy-to-use process to make the definitions simple.

Go to the Dataflow Jobs page (Figure 2). If you have an existing job, choose to Clone it or Create Job from Template. Once you’ve provided the desired parameters, click on Equivalent REST at the bottom to get this information (Figure 3), which you can cut and paste directly into the job’s Parameters section.

Dataflow Jobs page

Figure 2: Dataflow Jobs page

job Parameters section

Figure 3: Cut and paste into job Parameters section

"jog-dflow-gcs-to-bq-fraud": {"Type": "Job:ApplicationIntegrator:AI Google DataFlow",
        "AI-Location": "us-central1",
        "AI-Parameters (JSON Format)": "{\"jobName\": \"jog-dflow-gcs-to-bq-fraud\",
        \"environment\": {        \"bypassTempDirValidation\": false,
        \"tempLocation\": \"gs://prj1968-bmc-data-platform-foundation/bmc_fraud_details/temp\",
        \"ipConfiguration\": \"WORKER_IP_UNSPECIFIED\",
        \"additionalExperiments\": []    },    
        \"parameters\": {
        \"javascriptTextTransformGcsPath\": \"gs://prj1968-bmc-data-platform-foundation/bmc_fraud_details/bmc_fraud_details_transform.js\", 
        \"JSONPath\": \"gs://prj1968-bmc-data-platform-foundation/bmc_fraud_details/bmc_fraud_details_schema.json\",
        \"javascriptTextTransformFunctionName\": \"transform\",
        \"outputTable\": \"sso-gcp-dba-ctm4-pub-cc10274:bmc_dataplatform_foundation.bmc_fraud_details_V2\",
        \"inputFilePattern\": \"gs://prj1968-bmc-data-platform-foundation/bmc_fraud_details/bmc_fraud_details.csv\", 
        \"bigQueryLoadingTemporaryDirectory\": \"gs://prj1968-bmc-data-platform-foundation/bmc_fraud_details/tmpbq\"    }}",
        "AI-Log Level": "INFO",
        "AI-Template Location (gs://)": "gs://dataflow-templates-us-central1/latest/GCS_Text_to_BigQuery",
        "AI-Project ID": "sso-gcp-dba-ctm4-pub-cc10274",
        "AI-Template Type": "Classic Template",
        "ConnectionProfile": "JOG-DFLOW-MIDENTITY",
        "Host": "gcpagents"
      }, 

SLA management

This job defines the SLA completion criteria and instructs Control-M to monitor the entire workflow as a single business entity.

"jog-mc-fm-service": {"Type": "Job:SLAManagement",
	 "ServiceName": "Model testing and scoring for fraud detection",
	 "ServicePriority": "3",
	 "JobRunsDeviationsTolerance": "3",
	 "CompleteIn": {
	    "Time": "20:00"
	  }
	},

The ServiceName specifies a business-relevant name that will appear in notifications or service incidents, as well as in displays for non-technical users, to make it clear which business service may be impacted. Important to note, Control-M uses statistics collected from previous executions to automatically compute the expected completion so that any deviation can be detected and reported at the earliest possible moment. This gives monitoring teams the maximum opportunity to course-correct before any impact to business services is detected.

Examining the state of the pipeline

Now that you have an idea of how jobs are defined, let’s take a look at what the pipeline looks like when it’s running.

Control-M provides a user interface for monitoring workflows (Figure 4). In the screenshot below, the first job completed successfully and is green, the next three jobs are executing and depicted in yellow. Jobs that are waiting to run are shown in gray.

Control-M Monitoring Domain

Figure 4: Control-M Monitoring Domain

You can access the output and logs of every job from the pane on the right-hand side. This capability is vital during daily operations. To monitor those operations more easily, Control-M provides a single pane to view the output of jobs running on disparate systems without having to connect to each application’s console.

Control-M also allows you to perform several actions on the jobs in the pipeline, such as hold, rerun, and kill. You sometimes need to perform these actions when troubleshooting a failure or skipping a job, for example.

All of the functions discussed here are also available from a REST-based API or a CLI.

Conclusion

In spite of the rich set of ML tools that Google Cloud provides, coordinating and monitoring workflows across an ML pipeline remains a complex task.

Anytime you need to orchestrate a business process that combines file transfers, applications, data sources, or infrastructure, Control-M can simplify your workflow orchestration. It integrates, automates, and orchestrates application workflows whether on-premises, on the Google Cloud, or in a hybrid environment.

]]>