I’ve been around data (and now big data) for the last 20 years, working at companies like Apple, GoPro, Roku and Malwarebytes. And one thing I’ve learned is that we’re all on a big data journey. In my current role at Malwarebytes I lead the Data and Artificial Intelligence team. Malwarebytes is 100% focused on creating the best disinfection and protection solutions to combat the world’s most harmful internet threats and my team’s mission is to leverage big data technologies to harness the vast amounts of data we process to deliver key insights that give us sustainable competitive advantages. More on that below!
Malwarebytes recently celebrated its 10-year anniversary. During these ten years, we’ve built a relationship of trust with our customers which include consumers as well as large enterprises. We’ve been solving some of the toughest challenges for our customers such as bailing out infected endpoints when all else had failed. Over the years this has allowed us to develop some of the industry’s most comprehensive endpoint protection. We call it Multi-Vector Protection (MVP). We collect billions of records each day on a millisecond by millisecond basis. We use this data to identify, profile and provide protection against the world’s most harmful threats that are emerging every second.
If you come from the traditional side of data, you’re used to structured data that is high in information-density (think Excel). With big data, you’re flooded with data that is very low in information density (think computer logs), and it’s hard to consume.
In my first big data attempt at a previous company, we tried to drop Hadoop right in the middle of our data warehouse. But we failed to see the challenges caused by moving structured data to unstructured files, process it, then putting it back in a structured format for consumption. We had lots of job failures. After six months of trying really hard we decided to abandon that strategy and find a better way.
Next we tried creating parallel data paths. We kept the existing flow of data into a warehouse, but created a separate Hadoop infrastructure to ingest logs, micro-transactions, etc., and brought everything together with a data mart. This worked, but it was expensive because we were paying for two separate infrastructures. And to get all the data to fit in the data marts, it had to be aggregated to the extent that much of the detail was lost. We also had major issues with job execution and orchestration between clusters.
But when I joined Malwarebytes, I had the unique opportunity to build a big data platform from the ground up, without the shackles of legacy systems. We started our big data journey by establishing two critical pillars around which everything else would be built – infrastructure and workload automation.
Because we needed a way to build solutions that could be deployed quickly and speed time-to-market, our clear choice was to use a cloud-based infrastructure. Amazon’s AWS leads the way with the unique capabilities and cutting-edge technologies it provides. Amazon is solving real-world problems the right way, so it was a natural fit for us.
It was also critical that we had a world-class orchestration platform that could empower our engineers to focus on solving our big data challenges. We didn’t want them to spend time figuring out job failures or struggle with job scheduling and orchestration logic. So we chose Control-M, a time-tested orchestration platform, to underpin all of our application workflow orchestration and business critical SLA management.
What does it look like in action?
We perform sophisticated upstream processing using Apache Kafka, Kafka Streams, and Redis Enterprise for real-time and batch processing preparation. Some of the data is used for real-time dashboards and all of the data is pushed into a centralized batch layer where processing is managed by Control-M. It’s then processed by the serverless ETL framework AWS Glue to transform the data so that it’s consumable for business users via web applications and data scientists for machine learning and AI. We leverage real-time streaming and web sockets on Socket Cluster and Node.JS to display a real-time map that shows the detection (and remediation) of malware around the world. Control-M and Redis Enterprise are key components to transform and build global infection maps of every type of malware that we find. We have built proprietary technology that allows us to see where infections start and how they spread around the world – from Patient 0 to the current global infection landscape.
The image below is a representation of our architecture and I will briefly describe how this fits together.
Data from end-points is streamed and collected in Kafka where it goes through some enrichment. It then lands in our Data Lake in Amazon S3. Control-M orchestrates the AWS Glue ETL transformation for data cleaning, enrichment, optimization and aggregation. Control-M also orchestrates transformation and web caching in Redis Enterprise and AWS Aurora. A big advantage of using AWS Glue is the Glue Catalog which allows us to leverage common data definitions (on Big Data files) across AWS Glue ETL, SQL operations in AWS Athena, and EDW operations in AWS Redshift. Redis Enterprise allows us to manage all of our stateful databases and shared web cache for infection maps in 1 high availability cluster. The next steps are to refresh the Tableau dashboards, publish data to our EDW (for Looker to consume) and then re-train our AI models on AWS Sagemaker. Control-M triggers the Tableau dashboard refreshes that are based on the data we have in AWS Athena and triggers the retraining of models that we have in Amazon Sagemaker for machine learning and predictions. Keeping models trained is a key activity in machine learning and Control-M removes the worry of model going stale as it is intimately aware of when underlying data features are ready. We use AWS Spectrum to allow AWS Redshift to read data directly from S3, thereby eliminating the need to copy over large amounts of staging data into Redshift. All of the subsequent ETL code to build data marts in Redshift are orchestrated by Control-M for Looker to consume.
It is the various presentation layers where the end users get to see the insights of all the complex backend processing of data. Our end users depend on this data to make decisions everyday so it is business critical for us to deliver this data on time every time. Control-M manages the end-to-end orchestration and in addition to that it ensures that we meet our SLAs by monitoring any failures and delays in the data pipeline and then providing business context on what impact do these delays or failures have on our SLAs.
Continuous Integration and Continuous delivery
We can’t manage this complex operation and orchestration at the needed speed and scale without using continuous integration and continuous delivery. Control-M is a fundamental component of our architecture providing the orchestration backbone for long-running and repetitive processes so it was highly desirable to apply the same CI/CD process used for the business logic written in Java, Scala and Python to Control-M‘s orchestration logic written in JSON.
We use GitHub for our code repository and Jenkins for Continuous Integration (CI). When changes are committed to the main development branch, a Jenkins build is triggered and the program and job code is built, tested and deployed in unison. For JSON, Control-M provides RESTful services for validating syntax as well as for running tests to verify proper operation. All code, both job (JSON) and program code (Java/Scala/Python) have automated test scripts proper code regression is done.
Today (over a year later), we’ve continued to evolve. We’ve implemented a Lambda architecture, combining real-time streamed processing with batch analytics – all underpinned by Control-M and AWS. And everything is married together with web applications, dashboards, data science and machine learning.
Control-M and AWS have been key to helping us lead the way in threat detection analysis, and has helped position us at the cutting edge of our industry. We now consider data a strategic advantage we have over our competitors and more importantly the bad guys.