Ajoy Kumar – BMC Software | Blogs

Architectural Approach for Building Generative AI Applications

Ajoy Kumar — Fri, 02 Feb 2024 13:35:53 +0000

This is the second blog in my series following “Requirements for Building an Enterprise Generative AI Strategy,” where I highlighted the significant challenges and expectations of enterprise customers for generative AI, with detailed requirements for building a strategy. My recommendations centered on being grounded in enterprise knowledge, integrating references for trust and verifiability, ensuring answers are based on user access control, and creating model flexibility.

In this blog, I introduce a reference architecture designed specifically for generative AI applications, demonstrate how this architecture effectively addresses generative AI enterprise challenges around trust, data privacy, security, and large language model (LLM) agility, and provide a brief overview on LLM operations (or LLMOps). As a refresher, BMC HelixGPT is our approach to generative AI integrated across the BMC Helix for ServiceOps platform.

Reference architecture for generative AI applications

An application architecture describes the behavior of applications used in a business, focused on how they interact with each other and users. It is focused on the data consumed and produced by applications, rather than their internal structure. The industry has long recognized three prominent AI design patterns to build generative AI applications:

Prompt engineering
Retrieval augmented generation (RAG)
Fine tuning pipelines

Instead of debating which approach is better, BMC HelixGPT seamlessly integrates all three.

The diagram below shows that our BMC HelixGPT application reference architecture for generative AI. The architecture consists of several layers: API plug-ins, prompt library, vector data source ingestion, access processing control, model-training pipeline, and assessment layer to assess hallucination/telemetry/evaluations, a “bring your own model” embedding layer, and an LLM orchestration layer. BMC HelixGPT extensively uses LangChain as the engine to orchestrate and trigger LLM chains.

Reference Architecture for GenAI

The BMC HelixGPT proprietary generative AI technology, combined with LangChain open source models, provide “bring your own model” flexibility for our customers. There are also retrieval plug-ins, access control plug-ins, and API plug-ins that integrate into enterprise systems. Like the holistic design explained in this June 2023 blog by Andressen Horowitz, we have three main flows:

Data ingestion and training flow: Data is read from multiple data stores, preprocessed, chunked, and trained through an embedding model (RAG) and training pipeline (fine-tuning). VectorDB stores the chunked document embeddings that allow for better semantic, similarity-based data retrievals.
Prompt augmentation using data retrieval: Once a user query arrives at the API layer, the prompt is selected, followed by data retrievals through VectorDB or API plug-ins to get the right contextual data before the prompt is passed to the LLM layer.
LLM inference: This is where there is a choice to use general purpose foundation models from OpenAI, Azure GPT models, or the self-hosted foundation model in BMC HelixGPT. Fine-tuned models are used when tuned for a specific task or use case. The response is evaluated for accuracy and other metrics, including hallucinations.

Now, let us look at how this reference architecture addresses the challenges of generative AI for enterprises and facilitates the rapid development of generative AI applications.

Overcoming common enterprise challenges with generative AI deployments

Enterprise versus world knowledge: accuracy and trust

Enterprises seek answers across diverse internal and external enterprise data sources such as articles, web pages, how-to guides, and more. Further, data can be contained in both unstructured and structured databases. BMC HelixGPT ingests, chunks, and embeds these sources through LangChain data loaders using embedding transformer-based models. LangChain provides a rich set of document loaders that it leverages. When a user question is received, we augment the prompts with document retrievals from VectorDB or APIs and use the LLM’s in-context learning to generate a response. This method anchors the LLM’s response to the retrieved documents or data, reducing the risk of hallucinations. BMC HelixGPT also provides the retrieved documents as citations, allowing users to verify the responses. To realize this advanced capability, our strategy integrates various LangChain capabilities, such as retrieval QA chains with sources and conversation history chains.

Access control, security, and data privacy

During the retrieval of document flow, BMC HelixGPT validates that the user has access permissions to read the documents and removes those documents from the prompt context that the user doesn’t have access to. This ensures that LLM-generated answers are always from only those documents that a user has read access to. Hence, the same question will generate two different answers aligned to the user’s role and permission model.

Model flexibility and security

The BMC HelixGPT reference architecture is based on a model abstraction layer that LangChain provides. This capability enables seamless integration of foundational general-purpose models, whether hosted or behind APIs such as OpenAI and Azure or open-source models running in customers’ centers. There are over 50 connectors to different model providers in LangChain, making it easy to add new providers or models modularly. Customers who prioritize data security have the option to host and run a foundational model in the datacenter. This model architecture caters to diverse enterprise customers and prevents vendor lock-in, including implementations that provide the strongest privacy and security guarantees.

An Introduction to LLMOps

Machine learning for IT operations (MLOps) for LLM is called LLMOps. LLMOps is a new set of tools and best practices to manage the lifecycle of LLM-powered applications, including data management, model management, and LLM training, monitoring and governance LLMOps is the driving force to build generative AI applications for BMC HelixGPT.

BMC HelixGPT is a platform that provides models and services that allow applications to harness the power of generative AI. It also provides LLMOps foundational services such as prompt engineering and RAG to power a spectrum of use cases ranging from summarization to content generation and conversations.

LLMOps is distinct from MLOps because it introduces three new paradigms for training LLMs:

Prompt engineering
Retrieval augmented generation
Fine-tuning pipelines

My third and final installment in this blog series will dive deeper into BMC HelixGPT’s LLMOps capabilities.

Requirements for Building an Enterprise Generative AI Strategy

Ajoy Kumar — Mon, 11 Dec 2023 17:53:45 +0000

While ChatGPT and GPT-3.5 ushered in a wave of innovations last year, the team behind BMC Helix had already been hard at work for the past few years exploring ways to adapt generative artificial intelligence (AI) technology to enhance enterprise service management applications, improve natural language conversations, emulate human language in chatbots, contextualize knowledge search, and make enterprise service management recommendations for case resolution. The team built several large language model (LLM) prototypes designed to be interoperable with the entire BMC Helix for ServiceOps platform. Our approach to generative AI was purposeful, focusing on the needs of our enterprise customers, and then delivering new use cases that would leverage the technology to resolve problems faster and with greater accuracy.

The result is BMC HelixGPT, a pre-trained generative AI LLM service that integrates into BMC Helix applications, learning from your enterprise’s knowledge (including user profiles and permission models) to deliver a tunable, prompt-driven conversational user experience. As 2023 nears the end, we have not only built a scalable generative AI foundation with BMC HelixGPT, but we have also released five new HelixGPT-Powered capabilities:

BMC HelixGPT-Powered Helix Virtual Agent
BMC HelixGPT-Powered conversations in BMC Helix Digital Workplace
BMC HelixGPT-Powered live chat summarization
BMC Helix GPT-Powered resolution insights
BMC HelixGPT GenAI LLMOps stack, a proprietary generative AI app builder tool (for advanced users)

This blog will be the first in a series that describes our journey in building BMC HelixGPT from end to end and shares key best practices for building a generative AI application with a powerful foundational LLM platform to power it all. If you are building generative AI apps or models, this blog series is for you. We will divide our journey into three parts:

Part 1 (this blog) will focus on unraveling the needs and expectations of a generative AI solution.

Part 2 will outline the components of the BMC HelixGPT platform reference architecture such as those from LangChain, which provide a framework to interact with LLMs, external data sources, prompts, and user interfaces.

Part 3 will show how BMC Helix for ServiceOps leverages BMC HelixGPT to power new enterprise and operations management use cases with leverages BMC HelixGPT’s LLMOps capabilities to power enterprise generative AI.

Getting started with generative AI

ChatGPT demonstrated to the world the possibilities of generative AI. What impressed people the most was its ability to quickly provide answers on a vast array of topics in clear, understandable language. Enterprises soon demanded a more tailored approach to the technology, with answers that would be more specific to their internal knowledge and data versus the “world knowledge” that early models were being trained on. To articulate the strategic generative AI direction for BMC Helix, we adopted a systematic three-step process that is universally applicable for enterprises considering generative AI product use cases:

Prioritize use cases based on business priorities.
Build proofs of concept and get early customer feedback.
Understand customer expectations.

Prioritize your enterprise generative AI use cases

One of the initial steps an enterprise must take is to prioritize use cases that align with business goals and priorities based on data availability, customer impact, team skills, and business considerations. In the enterprise service management space, we started with three key use cases that were most impactful for our customers:

Virtual agent and knowledge search
Resolution insights
Summarization

Build proof of concepts and get user feedback early

Once use cases are identified, enterprises need to build proofs of concept to validate the concepts of generative AI. We built customer proofs of concept based on customer data for each of our top three use cases to get feedback through a design partnership program with our customers. While one team was using retrieval augmented generation (RAG)-based approaches and showcasing this to our customers with real customer queries, another team built fine-tuned models for multiple use cases. Early prototypes in resolution insights, generative search, and chatbots were highly impressive and provided us with the learning opportunity to understand and appreciate the unbelievable power and limitations of LLMs.

Understand enterprise needs and expectations for generative AI

After talking to multiple customers, their expectations of a generative AI solution became clear.

Specificity, accuracy, and trust

Enterprises want answers specific to their data, not generic answers yielded from broad world knowledge. Take, for example, a question regarding resolving a VPN issue, such as “how to fix a VPN connection issue?” Generative AI should generate an answer based on the enterprise VPN articles inside that enterprise and not a generic, plausible-looking answer generated by broad models. They want factual and truthful answers without any hallucinations that are commonplace. Enterprises also want the ability to verify answers with citations or references to build trust in how generative AI sources its answers.

Data security, privacy, and access control

Enterprises have variable user access controls about who can access different levels of data, so answers from generative AI solutions also need to adhere to those same access control policies. For example, a manager and an employee should get different HR answers to the same question because a manager has access to a larger set of documents. A few of the enterprises were also concerned about preserving the privacy of their data and ensuring that it would not be used to train a public model.

Real-time data ingest

Since enterprise data is constantly changing in real time, answers must be also based on the most up-to-date, available knowledge inside the company.

Avoid vendor lock-in of generative AI modeling

Finally, we heard that many enterprises wanted the flexibility to choose their own commercial models like Azure, OpenAI, or open source.

Our early prototypes and high-level requirements collectively shaped the foundational thinking behind what enterprise customers expect from a generative AI solution. In the next blog, I will explain how we addressed customer expectations in our BMC HelixGPT generative AI reference architecture. Stay tuned. In the meantime, you can learn more about BMC Helix GPT here.

Detecting Major Incidents using Automated Intelligence and Machine Learning

Ajoy Kumar — Thu, 22 Jul 2021 12:41:54 +0000

It’s a typical Monday morning and incidents are streaming in. As a service desk manager, you notice more incidents than usual—“something just does not look right for Skype business services.” You start chatting with a few service desk agents to understand their incidents and determine if there is a pattern among them, but while doing so more incidents pile up, impacting customer satisfaction and delaying resolution. You need quicker answers to these two questions:

Is there a major incident brewing for any business service right now?
How many duplicate incidents are being created and worked on by different service desk agents?

Early detection of major incidents is critical for achieving higher customer satisfaction while improving the efficiency of the service desk. Artificial intelligence and machine learning (AI/ML)-driven clustering can help address these challenges effectively.

Natural language processing (NLP) can be used to understand the meaning of each incident and then similarity-based ML algorithms that continuously group the streaming incidents into meaningful, evolving groups of incidents that are correlated based on time, text, and business services.

Challenge #1: Is there a major incident brewing for any business service right now?

A major incident is defined as a critical and urgent issue that has widespread organizational impact and affects multiple users or regions. It is usually associated with an outage of a business service and can cause financial impact to the company.

The typical scenario is that service desk agents start to see a flood of critical incidents on a specific service. These could be a mix of both user-generated and infrastructure-triggered incidents. Service desk managers rely on “word of mouth” to detect major incidents by calling agents or having group chat sessions—ad-hoc, inconsistent, and non-repeatable solutions that can delay the detection of major incidents. Service desk managers need a better way.

Challenge #2: How many duplicate incidents are being created and worked on by different service desk agents?

Not every incident storm is a major incident. Duplicate incidents may reflect a very localized issue. For example, if I have five incidents related to Salesforce application file downloads, a traditional incident management system would have five agents working on these independently. This can cause huge inefficiencies, especially if all five incidents relate to the same underlying cause. Detecting duplicate incidents so they can be managed efficiently is critical to an organized service desk operation.

A better way

Let’s see how we can address these two challenges by using an AI/ML-driven clustering workflow:

1. Match and maintain cluster lifecycle. As new incidents are created, they are matched, using common ML-based similarity algorithms, with existing incidents to determine the degree of similarity. NLP and pre-trained language models such as BERT (Bidirectional Encoder Representations from Transformers) or TF-IDF (term frequency inverse document frequency) techniques can be used for text. For example, assume we get three new incidents within a few minutes, each having slightly different text but the same intent: “I cannot login to Skype,” “My Skype fails to login,” and “Skype login doesn’t work.” NLP and language models-based algorithms will detect that all three incidents have the same meaning (“intent”) and group them into a single cluster.

Once a match is determined with high confidence, a new cluster of incidents is formed and tracked by the system. As new incidents flow in, they are compared to existing clusters to determine which one they belong to. Clusters can evolve and grow as more incidents are matched and added.

Incident clusters are interesting and useful only for a short period of time, so the cluster is closed after a set time period (e.g., 30 days) if there are no additions. Incident clusters can also be closed if all incidents in that cluster are resolved. Automatically closing clusters is an important capability so that only the top “emerging” and “fresh” situations are presented to the service desk manager. The lifecycle of a cluster is managed by the system for incident creations, closures, and incident updates.

2. Detect major incidents. Major incidents are detected by identifying fast-growing clusters of incidents, as well as those that have high criticality. Multiple factors should be considered, such as: incident count; average priority/criticality of the cluster; the importance of the business service that the cluster impacts; the region(s) where it is happening; and so on. Notifications based on these criteria, as well as drill-down visualization of these incident clusters, inform whether a major incident should be raised.

In our Skype example, if there were only three incidents in the cluster, it is likely not a major incident. However, if you get 30 incidents on Skype within 10 to 20 minutes, all low priority or a few high priority, then this cluster needs to be identified as a major incident based on predefined thresholds. The ability to view the cluster’s aggregate-level properties (incident count, average, and trending of count), as well as average priority, region, and service impacted are important considerations in deciding whether or not this cluster is a major incident. Visualization tools and automated rules engines that can use customer-specified threshold criteria to indicate and notify major incident candidates would help speed major incident detection.

3. Recommend and manage duplicates. Modern IT service management (ITSM) systems can manage and prevent service duplications by manually creating parent-child hierarchies so that child incidents do not need separate agents, i.e., since only one agent is assigned to the parent incident. This greatly improves the efficiency of a service desk.

Until now, detecting and managing duplicate incidents has been largely a manual, inefficient process, which increases the workload of each agent without adding value. AI/ML technologies for clustering can present recommendations to the user on parent-child relationships to streamline service desk operations. By recommending parent/child tickets, instead of assigning one agent to each child, you can assign one agent to the parent. This can save a two-to-five-time factor of duplicate work and hence improve efficiencies.

Conclusion

You need a major incident warning system to avoid extended outages and impacts. BMC Helix ITSM has recently released these ML-powered capabilities as a part of the ITSM Insights offering to help you achieve that.

Container Security Best Practices

Ajoy Kumar — Fri, 26 Apr 2019 00:00:16 +0000

In this Run and Reinvent podcast, I’m joined by Maya Kaczorowski, a product manager at Google, to discuss container security. Below is a condensed transcript of the discussion.

Ajoy Kumar: I know a lot of folks know what is a container, but how would you describe to someone who is technical but not that familiar with containers?

Maya Kaczorowski: If somebody knows what a container is, you know it’s taking your application, packaging it with the libraries and binaries that it has as dependencies together. That makes it easier for you to move it around your infrastructure and deploy it on other pieces of info that you have. You can develop once and deploy multiple times across the infrastructure. Form a security point of view, it’s just about having more consistency in that environment.

Your DevOps will like it because they don’t have to do, hopefully, as much work to deploy the same thing in many places. But from a security point of view, it means you can pull out the same thing in many places, which means fewer security reviews, less things that you have to worry about patching, having a more homogeneous, consistent environment.

Ajoy: What do you think, containers are more secure or less secure than VMs and some of the common things in your infrastructure? How would you compare container security with VM security, for example?

Maya: I think they have the potential to be more secure. So, there’s only a couple of things that really change when it comes to true underlying security between containers and VMs. Containers don’t historically have the strong isolation boundary. So, containers don’t actually contain, despite the name. So, they’re not meant for untrusted workloads. That being said, there are new projects like gVisor, like Kata Containers, and Nabla Containers, they are specifically meant to provide better isolation for containers. So, that’s kind of going away and changing. Some of the other things that are different is just the industry isn’t there yet necessarily, so just like how you’d want to have an IDS or an IPS for a networking tool, or an identity management tool or whatever for your VMs, you’re going to want to have the same thing for your containers. But those don’t all exist yet. There still isn’t a ton of choice in the market for users who are looking for those solutions for containers specifically.

Ajoy: What is gVisor?

Maya: gVisor is a sandboxing technology that Google developed and open-sourced. It’s basically a way for – it’s a sandbox container runtime, to let you run untrusted environments on the same host as a more trusted workload. So, the idea being if you’re running a multitenant environment, you don’t necessarily want somebody who’s running something potentially malicious for that process to escape and affect other things that are happening on the same node. Like I said, traditional containers aren’t sandboxes, are not meant to be sandboxes, so gVisor is supposed to help prevent and provide a security of layers of the sandbox behind that. It’s effectively emulating a kernel that each individual container would talk to, but in guest mode. So, you actually have some virtualization-based isolation between what’s happening in your containers.

Ajoy: What tools should one be thinking when one thinks about container security?

Maya: How we’re thinking about container security at Google is trying to think about this comprehensively. So, if you think about the kinds of problems that you’re going have in your environment, and how – and what kinds of threats you’re going be seeing – for example, one set of threats is going be things that people are trying to do to your APS server, so if you’re running an – because orchestrations are from Kubernetes, people trying to get Kubernetes to do something that’s bad. This is about protecting your network, your secrets, your identities – all of that kind of stuff. And there, it’s not so much about tooling, as much as following best practices, putting in place what Kubernetes suggests having good secure defaults, templating things that make sense from the get-go.

The second area that we think about in terms of container security is what we’re calling the software supply chain. So, how you make sure that the images that you deploy, the container images that you deploy, don’t have any issues, don’t have any known vulnerabilities, meet your own internal requirements for compliance, are signed off on by your testing team, whatever it happens to be. And in that area, you probably need some new technology around how you verify those images, how you store your images, etc., before you deploy them into your environment.

And the last area that we’re thinking about is – we’re calling kind of runtime security, things that happen once you have your containers up and running into production. So, somebody is trying to dos your containers, flood your event pipeline, or the things that I was talking about is being – people are quite worried about container escape rate. What happens if you have a workload that tries to leave your container and affect somebody else’s container. And in that case, there’s also – that kind of traditional idea set, the IPS mindset for containers. How do we build the tool and make something there that works in that environment?

Ajoy: How do developers figure out if they have zero-day vulnerabilities in their software, or in some sense, any vulnerabilities?

Maya: Common vulnerabilities would be documented in the CVSS database, it’s the common vulnerability database for security. You have to think about vulnerabilities here at so many different levels, right? So, I’m mostly focusing on container images, but if you think about security as the whole stack, you have to worry about any vulnerabilities in your hardware; any vulnerabilities in your underlying kind of OS operated in a boot, etc.; the actual image that you’re going deploy on your hosts in the container world; the image that you deploy within the containers; any issues with your virtualization software; any issues with your container runtime, like Docker or something like that; any issues with the container orchestration, like Kubernetes.

You have so many, so many layers of dependencies and different things you need to protect, I mean all of those could have vulnerabilities. So, for a lot of these, the best practice is to just patch, make sure you’re on the latest version of the system. In some cases when you’re building your own applications and you have your container images, the best practice there is to scan those images for known vulnerabilities. And there is a bunch of tools that let you do that, both open-source tools and vendor tools, to verify does your image contain any known vulnerabilities?

Ajoy: How does operations really take on Kubernetes security, container security? How should they start thinking about it?

Maya: TI think a lot of the time what I see happen is somebody in DevOps, somewhere in your company will say, “Hey, I want to start using Kubernetes,” and will start doing it. Or maybe even in some cases some developers somewhere didn’t decide, but their boss’ boss’ boss decided that they were going to start using Kubernetes, and now they have to use Kubernetes. And this trickles down to the security team, and it trickles down to the operations team and all of that. Form a security point of view, like I said, as long as you’re aware that it’s going on, there’s actually quite a few documented best-practice guides in everything that you’re in a pretty good spot.

From the Ops point of view, what changes is your development and your deployment velocity. The whole point of moving to containers is that you can more quickly deploy new things to your environment. So, you’re dealing with redeploying entire container images rather than deploying code changes to your database – sorry, to your codebase. So, how you manage that kind of code checking process, that code review process, that deployment process, everything, that might be a little bit different than we were doing already today. The benefit that you have as an Ops person is that you can – you now have a single checkpoint, that deployment mechanism. Or you can enforce certain requirements for the code that ends up in your environment.

So, a security checkpoint might want to verify, for example, is, did I build my code? Or you might want to verify, did I scan my code for vulnerabilities? Etc. But you might also have operations checkpoints you want, like was my code properly tested? Have I tested it on this particular bespoke piece of hardware that I am particularly worried about? Is it valid to be deployed in this particular geography? These are some of the things that containers enable you as an ops person to be able to enforce in your environment.

BMC Cloud Operations Uses TrueSight Cloud Security

Ajoy Kumar — Tue, 14 Aug 2018 00:00:08 +0000

Yes, we eat our own cooking.

Have you ever wondered how BMC Software keeps its cloud environments safe and secure? One of the proudest and most thrilling moments for our Cloud Engineering team was using our TrueSight Cloud Security, BMC’s very own automated cloud security and compliance solution, to achieve 100% compliance of our multiple cloud environments. Seeing that dashboard transform from red to green in such a short time was quite an achievement. In this blog, we describe how we run Dev environment security with TrueSight Cloud Security (TSCS). BMC being our first customer provides us direct feedback so that we continuously deliver an ever-improving solution.

The Challenge – You cannot manage what you don’t measure

We have thousands of cloud resources changing every day in our development cloud accounts as developers continuously push new functionality to prod. S3 buckets, firewall security groups, IAM roles, EC2 instances, and more are created or updated with daily pushes through DevOps pipelines. After each release, the big question on everybody’s mind was, “Are we secure? Did we mistakenly open a port to the internet?” Yes, just like your dev teams, we are constantly striving to increase our agile velocity; yet, we do not want to compromise on security. The old business axiom, “You cannot manage what you do not measure” is especially true for DevSecOps and cloud security. So, our first step was to benchmark our security posture and look for ways to fix high risk vulnerabilities, burning down our security backlog to secure our cloud resources… and keep them that way!

The First Security Scan

It took us less than an hour to assess our security posture, much to the delight of the team and exactly as we had envisioned when we set out to build our cloud security service. We pointed TSCS cloud service to our own AWS cloud accounts and started scanning them for CIS compliance with over 50 controls for AWS. Within a few minutes, we finished the setup, and the initial security posture data started lighting up on the dashboard. RED! To our surprise, about 15% of our resources were noncompliant to the CIS security benchmarks.

Figure 1: Example dashboard in TrueSight Cloud Security showing security at a glance

Thankfully, no S3 buckets were open to public, as so many highly publicized data breaches were caused by publicly accessible S3 buckets. With automation built-in by design, TrueSight Cloud Security can easily find and fix such open public buckets in minutes. We were, however, alerted that 50 S3 buckets did NOT have encryption turned on. Ouch. Encryption at rest is a best practice for data security that we needed to resolve. We also found 70 IAM policy rule violations, such as lack of MFA, lack of key rotation, and more. Key compromise is one of the ways that can lead to data or account breaches, and we knew this one also would need a quick resolution. Once we knew our complete security posture in our development accounts, the team started triaging, risk assessment, and putting together a plan for remediation. PRO TIP: Assess posture, triage results, and build a remediation plan, much like you would groom a Scrum backlog.

Instant Remediations

First things first: 70 critical IAM security violations were easy to fix, and the team remediated most of these through a single click from TSCS. Cloud Operations were confident about these not impacting Application and quickly remediated these. Nice going. As the engineering team kept deepening remediation action content, we soon realized the power of instant remediation with a single click from our UI. On a few others security issues, such as S3 buckets and EBS web tier encryption, the team created Jira tickets to track findings from TSCS and assigned them to developers. As our infrastructure is immutable, these issues were resolved through the DevOps pipeline updates to infrastructure. Within hours and days, we dramatically improved our security posture, with only 5% noncompliant rules remaining. Measuring security posture and reporting it in our dashboards and PDF reports motivated the team to continuously improve. PRO TIP: Gamification of security posture across multiple accounts and teams clearly leads to higher security as nobody wants to be at the bottom of the security grade.

Exceptions

The process of managing security issues can at first seem daunting but TSCS helps manage the volume of these issues very effectively. Powerful search capability available in the UI details security posture by application, services, owner, or account tags, context which prioritizes the most critical app and infrastructure issues for resolution first. Many issues can be remediated quickly through actions from UI or through incident integration process. There are always exceptions. After discussing with Security, a few findings were identified as acceptable low risks and were moved to the bottom of the backlog. Security is a risk management process where highest risk issues need to be fixed first, while low risk issues can be deferred and put on an exception list. This is where our last of remaining 5% issues ended up. We used TSCS to mark the last of our findings as exceptions, to reduce noise and “alert fatigue” while the Dev team added this to their technical debt. PRO TIP: Some security violations should not be resolved by Cloud Ops, but by Development. Use RBAC to provide remediation privileges to those who own the code. Use the UI to filter security by app, tag, accounts, etc.

Multiple Accounts

As we completed securing our first batch of accounts, we quickly realized that multi-account management is a challenge. Leveraging the multi-account connector within TSCS, we created trust relationships to child accounts, streamlining multi-account security management. We now began collecting security data for multiple accounts. Teams are now filtering security findings by account and environment, as well as visualizing the aggregate security posture across accounts. Many enterprises, including BMC, have 100’s, even 1,000’s of public cloud accounts, where this account consolidation capability enables security at scale and simplifies security operations. PRO TIP: Begin with the end in mind. Start small, securing a handful of accounts, and be prepared to use account consolidation to simplify your cloud security methods.

We are the Watchers on the Wall

Getting the first scans and securing our accounts is only the first step. Are we done? Of course not. Our next step was to ensure that the high level of security is continuously maintained.

Daily reports automatically notify us of new security violations.
We are also working on a self-driving remediation feature, to automatically fix certain vital security issues (which we define), thereby reducing the mean time to resolve.

“We are the watchers on the wall… Night gathers, and so my watch begins… For this night, and all the nights to come.” — #GameOfThrones, from Oath of the Night’s Watch

Take the Test Drive

Would you like to see what TrueSight Cloud Security can do for you? Take the free 14-day trial. Connect the service to your account. Kick the tires. See how it drives. Then, if you’re interested contact Sales.

Five Best Practices for Building Security as Code into a Continuous Delivery Pipeline

Ajoy Kumar — Wed, 24 Aug 2016 15:24:05 +0000

This is the fourth blog in our mini-series that illustrates how BMC was able to use agile development, cloud services, an Infrastructure as Code approach, and new deployment technology to deliver a new cloud native product.

Be sure to also read: Getting Started with Cloud Native Applications, 9 Steps for Building Pipelines for Continuous Delivery and Deployment, and Infrastructure and How “Everything as Code” changes everything

You need agility to meet the requirements for digital business, but it’s also critical to ensure that security and compliance are built into your apps from the start. That involves thinking about security and compliance as code, and not as an afterthought or add-on. When security is not initially built into code, then compliance becomes much more onerous. As a result, just as a release is nearing deployment readiness it requires security vulnerability testing, penetration testing, threat modeling and assessments. These activities are done using tools, security checklists and documents. If critical or high vulnerabilities or security issues are found, the release is stopped until these are fixed, causing delays of weeks or even months.

The “Security-As-Code” principle addresses this challenge by codifying security practices as code and automating delivery as a part of a DevOps pipeline. This approach ensures that security and compliance are measured and mitigated very early– at the use case, design, development and testing stages of application delivery. Moving these processes earlier in the DevOps pipeline is known as the “Shift left” approach, which effectively allows you to manage security just like code by treating security attributes such as policies, tests and results as you would lines of code – storing them in the same repository and progressing them through the same DevOps pipeline.

Five essential best practices for applying security and compliance as code

How do you deliver apps quickly while ensuring that they are secure and compliant? We’ve identified five key best practices using “Security-As-Code” and “Compliance-As-Code” based on our experience in using a cloud native application running on Amazon Web Services (AWS).

Define and codify security policies
Security is defined and codified for any cloud application at the beginning of a project and is kept in a source code repository. As an example, a few AWS cloud security policies are defined below:

Security groups/firewalls secured
Virtual Private Cloud (VPC), Elastic Load Balancing, and other requirements are enforced and secured
All Elastic Compute Clouds (EC2) and other resources must be in the VPC and each resource is individually firewalled
Encrypt all AWS resources, as well as logs, and use Key Management Services

Security policies need to be automated. By pressing a button, you should be able to evaluate security policies for any application at any stage and environment. This is a major shift that happens when you follow the principle of security as code – everything about security is codified, versioned, and automated. Security artifacts are checked in as “code” in a repository and versioned.

Enterprises should build standardized security patterns to allow easy reuse of security across multiple organizations and applications. Using standardized security templates (also as “code”) will result in out-of-the-box security, instead of having each organization or application owner define the policies and automation by team. For example, if you’re building a 3-tier application, a standard cloud security pattern must be defined. Similarly, for a dev/test cloud, another cloud security pattern must be defined and standardized. Both hardened servers and hardened clouds are patterns that can yield security out of the box.

With AWS cloud, the 5 NIST AWS Cloud Formation templates can be used to harden your networks, VPCs, and identity and access management roles. The cloud application can then be deployed on this secure cloud infrastructure where these templates form the “hardened cloud.” For private cloud or on-premises servers, the server hardening can be done by products such as BMC BladeLogic Server Automation. For public cloud and application hardening, we have a new cloud service being developed to define and validate public cloud services.

Define security user stories and do a security assessment of the application and infrastructure architecture
Security-related user stories must be defined in the agile process, just like any other feature stories. Examples include “input validation for cross-site scripting and SQL injection,” “SSL/TLS enabled for all communication,” “automated application security testing,” and so on. This process will ensure that security is not ignored. Security controls are identified based on the application security risk and threat models for the application and infrastructure architecture. These controls are implemented in the application code or through policies.
Test and remediate security and compliance at application code check-in
As frequent application changes are continuously deployed to production, the security testing and assessment is automated early in the DevOps pipeline. Security testing is automatically triggered as soon as code check-in happens for both application and infrastructure changes.

There are many security vulnerability scanning tools available from BMC and other vendors or as open source software that can help automate and speed the detection and remediation of vulnerabilities. If critical or high vulnerabilities are found, automation tools can create “defects/tickets” for resolution that will be assigned to the owner of the component.

For modern cloud native applications, security and compliance testing at code check-ins involve evaluating infrastructure-as-code artifacts such as AWS cloudformation templates against policies. A new BMC cloud service can be used to enable this automation and integrated in a pipeline tool, such as Spinnaker.

Test security policies to ensure they don’t violate requirements
Security policies need to be tested for violations of security policies, such as “no s3 buckets should be publicly readable or writable” or “do not use open security groups with 0.0.0.0/all traffic allowed.” This testing needs to be automated as a part of the application delivery pipeline using security and compliance tools. For example, in our application delivery pipeline for a cloud native application, we automatically run security scans that are orchestrated from Spinnaker to ensure that every push to production goes through security scanning.For regulatory compliance policy checks — such as compliance with the Center for Internet Security and Defense Information Systems Agency requirements on servers, databases and networks — BMC BladeLogic suite of products can be effectively used to detect as well as remediate compliance.
Test and remediate security and compliance in production
As an application moves from development to QA to test, the automated security and compliance testing continues to happen. However, the risk of finding issues at this point is much lower because most of the security automation is done much earlier in the DevOps pipeline. Tools such as BMC BladeLogic Server Automation and BMC Threat Director can be used for production security patching after scanning production servers for mutable infrastructure. This is particularly critical because more than 80% of attacks target known vulnerabilities and 79% of vulnerabilities have patches available on the day of disclosure. For applications deployed in public clouds such as AWS, security and compliance evaluations can be automated through Spinnaker and new BMC cloud services.

In modern cloud-native application stacks, components are never updated or patched but they are always replaced whenever changes are needed in the application or infrastructure. Even in these cases, new containers or servers are first tested for security and compliance before deploying them to production.

Shift Left

Instead of approaching security as an add-on process at the end of the release, or as periodic security scanning of production environments, start with security at the beginning of an application release. Represent it as “code” and bake it in. By following these best practices, successful product companies can ensure security while maintaining the agility and speed of innovation.

Infrastructure and How “Everything as Code” changes everything

Ajoy Kumar — Wed, 17 Aug 2016 13:58:22 +0000

This is the third blog in our mini-series that illustrates how BMC was able to use agile development, cloud services, an Infrastructure as Code approach, and new deployment technology to deliver a new cloud native product.

Be sure to also read: Getting Started with Cloud Native Applications, 9 Steps for Building Pipelines for Continuous Delivery and Deployment, and Five Best Practices for Building Security as Code into a Continuous Delivery Pipeline

Remember how long it used to take to release software when infrastructure, security, compliance, and operations processes were done by independent teams separate from application development? Operations worked mostly in isolation, occasionally interacting with application development teams. The operations and development teams used their own, often different, tools and complex slow-moving processes, such as change boards, approvals, checklists and 100-page policy documents, along with specialized tribal knowledge. That is quickly changing as today’s DevOps teams embrace best practices, such as the concept of Everything as Code, to ensure agility while maintaining governance.

As organizations transform to deliver new digital services faster, the disciplines of infrastructure, security, compliance and operations must also evolve to meet the requirements for speed, agility, and governance. The idea behind the Everything as Code concept is that infrastructure, security, compliance and operations are all described and treated like application code such that they follow the same software development lifecycle practices.

We have been employing these principles in our development of a new SaaS product that runs on Amazon AWS. The following represents some of the lessons we have learned from others and then modified by our experiences in developing, delivering and implementing a new cloud native application.

Infrastructure as Code and the Impact on DevOps
Let’s look at some Infrastructure as Code best practices we’ve learned after operating several production cloud applications on Amazon Web Services (AWS) cloud. Infrastructure includes anything that is needed to run your application: servers, operating systems, installed software packages, application servers, firewalls, firewall rules, network paths, routers, configurations for these resources, and so on.

Define and codify infrastructure
Infrastructure is codified in a declarative specification, such as Cloud Formation templates for AWS cloud, Azure resource templates for Azure cloud, Docker compose and Dockerfiles, Chef cookbooks and BMC Cloud Lifecycle Management (CLM) blueprints for both public cloud and on-premises datacenters. These templates describe the cloud resources, their relationships and configurations. They are used to easily provision infrastructure and applications since these templates represent the single source of truth. They are also used for version control; to track and make changes to infrastructure and applications in a predictable, governed manner, often integrated with development tools. These benefits are the key reasons that infrastructure as code is being widely adopted.
Source repository, peer review and test
Next, Infrastructure as Code is kept in a version control system, such as Git, where it is versioned, under change control, tracked, peer reviewed and tested just like application software. This will increase traceability and visibility into changes, as well as provide collaborative means to manage the infrastructure with peer reviews.

Example: If operations wants to roll out a change to the production infrastructure, operations does not need to do it through a console directly in production, as traditionally done in IT datacenters. Instead, operations can create a pull request on “infrastructure as code” Git artifacts with peer reviews conducted on the changes and then the code is deployed to production. This review process ensures higher quality of infrastructure changes as multiple team members have visibility into the changes and can assess the impact of the changes. It also enables “testing” of the changes early in the cycle. Version controlled infrastructure changes also allow easy rollback to a prior infrastructure version.

Follow the DevOps pipeline process
Infrastructure as Code templates go through the DevOps pipeline just like application code and gets deployed to production. The DevOps pipeline allows the infrastructure change delivery and governance to ensure that changes are tested and deployed in a controlled manner before moving to production environments. At each stage of the DevOps pipeline, these templates are used to provision or update “environments” such as Dev, QA and Production, rapidly creating and de-provisioning dynamic infrastructure.Example: In AWS clouds, operations will do pull requests on CloudFormation templates to make changes to configuration parameters or AWS resources. These changes flow through lower environments such as Dev and Test, and are fully exercised. This ensures a higher confidence that changes will not adversely impact the application when these changes are promoted to the production environment.
Support Immutability
Finally, Infrastructure as Code also supports server and container immutability. Prior to immutable infrastructure paradigms, operations teams would manage infrastructure manually, by updating or patching software, adding software package dependencies, changing configurations and so on. This resulted in inconsistent infrastructure not only across development, test and production environments, but also within each of these environments. Inconsistent infrastructure makes troubleshooting difficult. It also means that the infrastructure is not easily scaled, updated or automated to achieve efficiencies in operations. With immutable infrastructure, operations engineers treat infrastructure as disposable. They don’t make changes to infrastructure such as servers or containers directly in production. Instead, they go through a full DevOps pipeline to create new server or container images through the DevOps pipeline, and then deploy into production to replace running servers or containers. This allows consistency of infrastructure in environments that facilitates automation in DevOps, auto-scaling and remediation.

Why DevOps Should Embrace Infrastructure as Code Principles
By following these best practices, Infrastructure as Code can be successfully used for both cloud-native and on-premises application delivery. Developers can specify infrastructure as a part of their application code and manage it all in a single repository. This keeps the full application stack code, definition, testing and delivery logically connected, resulting in better agility, consistency, automation and autonomy for developers for full-stack provisioning and operations.

For the operations team, best practices for software development are also applied to infrastructure, which helps to drive improved automation and governance, stability and quality without negatively impacting agility. Improvements in stability and quality can be attributed to following a DevOps pipeline with versioning, early testing of infrastructure, peer reviews and collaborative process — just like code. Finally, there is traceability and the ability to easily answer questions, such as:

What is my current infrastructure?
Who made infrastructure changes in the past few days?
Can I roll back the latest configuration change made to my infrastructure?

We strongly believe that using Infrastructure as Code principles in managing application delivery can result in compelling advantages to both developers and operations by increasing agility while maintaining governance.

How to Build a CD Pipeline

Ajoy Kumar — Wed, 10 Aug 2016 01:01:39 +0000

This is the second blog in our mini-series that illustrates how BMC was able to use the Spinnaker continuous deployment platform to deliver a new cloud-native product that we push to production once a week.

Be sure to also read: Getting Started with Cloud Native Applications, Infrastructure and How “Everything as Code” changes everything, and Five Best Practices for Building Security as Code into a Continuous Delivery Pipeline.

We had a big dream for the cloud-native SaaS application that we were building – to deploy it like Google, Facebook, and Netflix – by pushing hundreds of production changes each week. We started with -our vision of the product as well, our requirements for a high-velocity continuous deployment pipeline, and awareness of the continuous integration/continuous deployment (CI/CD) tools in marketplace.

We decided to start with the Spinnaker Open Source Software tool and within a short three week period, a two–person team had built eight pipelines using the Spinnaker CD tool, one for each of our 8 microservices, and we began actively pushing software to the Amazon AWS cloud on a weekly basis. This blog describes our DevOps CI/CD journey and key learnings and best practices for achieving continuous delivery with Spinnaker.

Our requirements for a continuous deployment pipeline

Our overwhelming objective for CD was to create a repeatable, safe way to deliver software to production as quickly as possible while ensuring confidence and stability. We wanted to automate our pipelines, integrate with a wide array of test automation tools, and create dynamic environments on AWS cloud during various stages of deployment. We also wanted visibility into the status of all our pipelines and environments in development, QA and production. Finally, we wanted near zero-downtime capabilities in production, which could be accomplished using techniques such as canary, rolling, and Blue-Green deployment and rollback strategies.

Making the decision – Spinnaker vs. Jenkins

We had several decisions to make. We picked GitHub enterprise for managing our code and used Jenkins for building software and CI,

For our Continuous Delivery (CD) pipeline, we debated between using Spinnaker and Jenkins. Jenkins can be used for the CD pipeline; however, managing hundreds of job chains becomes quite complex with Jenkins and it also does not have advanced deployment and cloud capabilities.

After investigating Spinnaker, we found that it easily scales, supports application-centric advanced deployment strategies out of the box, has good pipeline visualizations, and supports management of multiple dynamic cloud environments. Based on our experience using Spinnaker over the past six months, we feel confident that it has significantly helped us to achieve our original goals for continuous deployment.

Best Practices for CD

Here’s an overview of best practices for CD based on our experiences using Spinnaker and running dozens of microservice pipelines for our cloud-native application.

Plan for frequent updates for each microservice pipeline

An application should be designed to contain many small microservices. Each microservice should have its own deployment pipeline so that it can be independently and frequently updated in production. Typical stages in a deployment pipeline are: build, unit testing, integration testing, which includes functional, API, performance and security tests. Each of these stages can also include creation of dynamic environments in the cloud so that it the stage can be provisioned, executed and decommissioned as a part of the pipeline. Spinnaker has a number of built-in cloud plug-ins, so creating and destroying environments in AWS, Google and other clouds can be done easily.

Declaratively specify application microservices as Infrastructure as Code

Each microservice is specified in terms of its application and infrastructure stack. This can include the application Jar file, Docker file or server image such as an AMI, any related service such as AWS Lambda functions, plus all configurations, policies, etc., all managed in an infrastructure-as-code manifest. This process ensures versioning, consistency, change auditing, and testing. Also, the build artifact resources (such as baked images, JARs, AWS Lambda functions) produced are consistently passed among all of the stages of a pipeline from test to production, eliminating the risks of varying environments or configurations. We used AWS CloudFormation templates as our infrastructure-as-code manifest and tested these through Spinnaker pipeline stages. Our third blog will describe this practice in more detail.

Visualize pipelines and environments

Live visualization of the pipelines, environments and the versions or build numbers of microservices running in these environments allow both developers and operations teams to have a shared common view and fix issues as soon as something “breaks” in the pipeline, such as failed QA tests or failed security tests.

Early left-shift security

Security should be done as early as possible in a lifecycle and must be automated as a part of the DevOps CI/CD process. Using Spinnaker stages, we do penetration testing and security testing of our templates and environments before the code gets into production. Learn more about this in our fourth blog on how we incorporated security and compliance as a part of our pipeline.

Test automation with prioritization and early feedback

Automating tests is perhaps one of the most critical aspects of pipeline design. Without full automation, high quality and coverage of tests, the deployment pipeline will result in production failures. The “doneness” criteria for each sprint requires comprehensive test automation in a production-like environment. Of course, in reality time is limited so you should prioritize automation around the critical capabilities and flows, and then reevaluate and prioritize cost-benefit of additional automation.

Use staging before production

The data and environment for testing and automating should mirror production as close as possible. Most pipelines have a “pre-production” stage before pushing software into production. We built multiple stages across many AWS regions with Spinnaker before pushing the software to production.

Advanced Deployment Strategies

One of our key goals is to update software in production very frequently and without any downtime. This is an area where Spinnaker excels by easily managing multi-region deployments, such as Blue-Green and canary deployments for server groups and clusters across AWS and other public IaaS clouds. Our cloud-native application was based on AWS PaaS services, like AWS Lambda and AWS Beanstalk, and Spinnaker leverages these services to provide similar functionality.

Monitoring user experience and metrics

Be sure to continuously get feedback about the user experience, response times, performance monitoring of key business services, and technical metrics for all environments from development, QA and production. Keeping an eye on these metrics and making them part of your pipeline stages will ensure that no degradation happens as you push out releases with increasing frequency.

Culture – Developers own it end-to-end

Culture plays a very important part of our process of agile development. Just having the right tools and technology doesn’t cut it. We indoctrinated a new set of developer responsibilities. This includes owning not just code but also owning the automation, pipeline and production operations for each microservice. This mind shift is critical to successfully adopting an agile, continuous delivery and deployment process.

While Spinnaker OSS is off to an amazing start, BMC discovered several key gaps in operationalizing Spinnaker that we are currently addressing within our development team. If anyone is interested in talking about what we are what we are doing to enhance Spinnaker, please contact spinnaker@bmc.com

Stay tuned for our next blog on about how treating Infrastructure as Code helped drive quality and consistency in our development pipeline.

Getting Started with Cloud Native Applications

Ajoy Kumar — Wed, 03 Aug 2016 01:01:18 +0000

This is the first blog in our mini-series that illustrates how BMC was able to use agile development, cloud services, an Infrastructure as Code approach, and new deployment technology to deliver a new cloud native product.

Be sure to also read: 9 Steps for Building Pipelines for Continuous Delivery and Deployment, Infrastructure and How “Everything as Code” changes everything, and Five Best Practices for Building Security as Code into a Continuous Delivery Pipeline

Gartner defines BiModal IT as an organizational model that segments IT services into two categories based on application requirements, maturity and criticality. Mode 1 is predictable and traditional, with an emphasis on exploiting what is known. It’s a reliable approach based on scalability, efficiency, safety, and accuracy. The challenge with Mode 1 apps is that the cycle times are long (i.e., months).

If you want agility and speed, then consider Mode 2 apps as part of your IT strategy. Gartner compares the Mode 1 style to that of a marathon runner. Mode 2, according to Gartner is like a sprinter, where work is exploratory and is often tested in short iterations, such as days or weeks.

Putting Mode 2 into Practice

Our team recently built a Mode 2 cloud native application on an Amazon Web Services (AWS) cloud and this project got us thinking about the unique enablers that led to creating a successful Mode 2 cloud app. Mode 2 apps are critical for meeting the requirements of digital business because they enable DevOps teams to accelerate the pace of innovation. I’d like to share with you some best practices in building these applications and getting your product to market quickly.

Act like a start-up and focus on the minimum viable product (MVP)

Startups usually have a solid vision and a focus on doing one thing very well. Software startups often focus on delivering the MVP quickly. This allows them to get feedback, iterate and develop better versions faster, rather than taking too long to develop what may or may not be the right product.
We acted like a startup charged with solving a customer problem. In this case, the problem was trying to develop a product that could help customers simplify assessments and compliance reporting for cloud operations. Keep in mind that Mode 2 cloud native applications require what can seem like a maniacal focus on the customer problem, market validations, and delivering value to the customer by zeroing in on one or two use cases. I’ve seen many Mode 1 projects that took 2 to 6 months just to define the product to build! We needed to be more agile. We knew exactly what to build and had an uncompromising faith in the product and what was required.

Use the Cloud for the higher-level services of managing the infrastructure

Our team decided to go “all in” with AWS cloud to get the heavy lifting done – the infrastructure automation, routing, message buses, load balancers, server monitoring, running clusters and servers, patching them, maintaining them, and so on. This “all-in” approach helped get the product to market faster by eliminating the need for us to have to deal with some of the following challenges:

Spending too much time on what seems like never-ending discussions related to determining the platform
Ensuring portability to multiple clouds or multiple data centers
Determining what should be on-premises or SaaS

Use Microservices-based architecture

We heavily used the microservices and 12-factor application principles in architecting our application to increase speed and agility. We designed six separate microservices based on key business use cases and functions.
Each microservice was independently and autonomously designed and built by one or two-person engineering teams. Each team had complete flexibility in not only picking the programming language, but also the data store. Of course, certain guardrails were clearly defined for all the microservices. For example, each microservice exposed a REST API with a declarative specification that was defined in Swagger and pushed to the AWS API Gateway. Each microservice chose its own database services if it needed persistent services. In addition, each microservice followed Infrastructure-as-Code practices, which I’ll discuss in blog #3, where the full-stack definition was declaratively stored in AWS CloudFormation templates.

Empower your team
Our fully autonomous team was charged with making our own decisions and shaping the destiny of our apps. Doing endless comparisons and engaging in dialog that created paralysis by analysis were avoided at all costs. After six months of using AWS for a variety of applications, we continue to be delighted and amazed by the power of the platform. While AWS manages the infrastructure, our attention is focused on addressing business problems that matter and building apps to solve them. Today, all of our conversations in the team are about customer use cases, user stories and design patterns based on PaaS and server-less paradigms and DevOps. We don’t need to spend time managing infrastructure such as machines, networking, firewalls, clusters of database machines or clusters for big data streams and so on. We let AWS take care of these while we focus on delivering value to the customer.

Stay tuned for our next blog on our AWS project to learn about how treating Infrastructure as Code helped drive quality and consistency in our development pipeline.

How To Introduce Docker Containers in The Enterprise

Ajoy Kumar — Thu, 08 Oct 2015 17:08:42 +0000

Docker container technology has seen a rapid rise in early adoption and broad market acceptance. It is a technology that is seen to be a strategic enabler of business value because of the benefits it can provide in terms of:

Reduced cost
Reduced risk
Increased speed

For enterprises that haven’t worked with Docker, introducing it can seem daunting. How do you achieve business value, run Docker in development, test, and production, or effectively use automation with Docker?

As experienced users of this transformative tool, we have had success with a three-step yellow brick road approach. This process will enable your enterprise to embark on the Docker journey too.

(This is part of our Docker Guide. Use the right-hand menu to navigate.)

Getting started with Docker containers

Step 1: Evaluation

In the early phases, engineers play and evaluate Docker technology by dockerizing a small set of applications.

First, you’ll need a Docker host. Ubuntu or Redhat machines can be used to setup Docker in a few minutes by following instructions at the Docker website.
Once the Docker host is set, at least initial development can be done in an insecure mode (no need for certificates in this phase). You can login to the Docker host and use Docker pull and run commands to run a few containers from the public Docker hub.
Finally, selecting the right applications to dockerize is extremely important. Stateless internal or non-production apps would be a good way to start converting them to containers. Conversion requires the developer to write Docker files and become familiar with Docker build commands as well. The output of the build is a Docker image. Usually, an internal private Docker registry can be installed or the public Docker hub can be used with a private account so your images do not become public.

Step 2: Pilot

In the pilot phase, the primary goals are to start bringing in IT and DevOps teams to go through infrastructure and operations to setup Docker applications. An important part of this phase is to “IT-ize” the Docker containers to run a pilot in the IT production so that IT operations team can start managing them. This phase requires that IT operations manage dual stacks:

Virtualization platforms like VMware vCenter and vSphere infrastructure for virtual machines (VMs)
New infrastructure for running Docker application containers

Management systems and software tools will be needed in four primary areas:

Build Docker infrastructure. Carve out a new Docker infrastructure consisting of a farm of Docker hosts to run containers alongside with traditional virtualization platforms and hybrid clouds.
Define & deploy your app as a collection of containers. Management system software can provide blueprints to define application topology consisting of Docker containers. Spin them up and then provide “Day 2” management of the containers for end users, such as start/stop and monitoring of Docker applications. They can also integrate with Docker Hubs or Docker Trusted Registry for sourcing images.
Build your delivery pipeline. DevOps products can offer CI/CD workflows for continuous integration and continuous deployment of Docker images.
Vulnerability testing of containers. Server automation tools can be used to do SCAP vulnerability testing of Docker images.

Step 3: Production

Now, you can deploy Docker containers to your production infrastructure. This will require not just DevOps and deployment of containers to a set of Docker hosts, but also security, compliance, and monitoring.

Supporting complex application topologies is a degree of sophistication many enterprises will, in fact, desire in order to:

Allow gradual introduction to the benefits of containers
Keeping the data in the traditional virtual or physical machines

Another degree of sophistication is the introduction of more complex distributed orchestration to improve data center utilization and reduce operational placement costs.

While in the previous phase we had used static partitioning of infrastructure resources into clusters, this phase will use more state of the art cluster schedulers such as Kubernetes or Fleet.

Governance, change control, CMDB integration, and quota management are some of the ways enterprise can start governing the usage of Docker as it grows in the enterprise. Container sprawl reduction through reclamation are additional processes that need to be automated at this level.

Final thoughts

Evaluate the business benefits at the end of each of these steps to determine if you’ve achieved ROI and accomplished your goals.

We believe that using this three-step phased approach to introducing Docker, with increasing sophisticated usage and automation, will make it easy to test drive and productize Docker inside enterprises.