Retailers wanting to upgrade their monolithic architecture systems to a microservices approach will need to build a management platform to truly capture its benefits. The advantages of microservices over monolithic architectures are evident: greater scalability, more customization, quicker updates, and easier testing. However, many organizations will face problems with environment stability, lack of dynamic scalability, and a loss of efficiency simply because they did not establish a proper microservices platform to support all the new services and changes.
For retailers, moving to microservices is especially important, as many bought into large monolithic software packages back when e-commerce was exclusively on desktop, but now need to be more flexible to adequately meet the expectations of their omnichannel demanding customers. Just building services and attaching them to the system will result in management, testing, stability, and implementation issues. Therefore, retailers need to invest in a microservices management platform that can support the development, deployment, management, and production operations of microservices and their application components.
The building of this platform should ideally occur during initial replatforming and the cloud migration of the first few applications. Generally speaking, the benefits of building a microservices management platform justify the investment once there are three to five services, depending on the complexity of the services in question.
In this article, we will first discuss why a microservices platform is needed, how to build one, and how to manage microservices. Then, we will describe the approach of adopting a cloud infrastructure at scale in an enterprise environment, explain the best methods of setting up a microservices platform, and provide an overview on the processes of change management and continuous delivery.
Before we begin, we’d like to review the challenges that the release engineering, production operations, and infrastructure teams face when migrating to the cloud and microservices. The key difficulty is that such a migration dramatically increases the number of configuration points across multiple dimensions:
The traditional approach to release engineering and production operations is often not ready to deal with this increase in complexity. In our experience, this approach breaks down at several weak points:
To efficiently adopt cloud and microservices-based architecture and deal with these new challenges, the development and operations teams need better processes and tooling. But before we dive further into the details, here’s a quick overview of the key topics we will touch on throughout the rest of the article:
Although not directly related to microservices, setting up infrastructure in the cloud properly is very important at the beginning of the migration and replatforming processes. A typical mistake that companies with a traditional approach at managing infrastructure make is onboarding a public cloud, but closing access to the cloud APIs for everyone. Instead of providing access via API, the infrastructure team continues to provide access to the cloud resources via a ticketing system or self-service UI portals, like ServiceNow or JIRA. However, such a setup removes most of the benefits that a cloud brings to the organization. Another common error is when full access is given to the development and operations teams, as this violates most company’s IT policies, and may lead to significant security, quality, and overspending issues in the future.
Fortunately, there is a middle ground where an infrastructure team sets up the base cloud infrastructure, such as projects, workspaces, networking, and firewalls, allows VM images, including OS and middleware, and provides API access to the development and operations teams. This enables easy provisioning of the cloud resources that are allowed on demand. Major cloud providers, including AWS, Google Cloud, and Microsoft Azure, provide a special mechanism to enforce various corporate policies while retaining access via API. A significant part of this mechanism is called IAM: Identity and Access Management (examples in AWS and GCP). An enterprise infrastructure team may configure policies related to accessing cloud resources, budgets, networking, firewalls, allowed base VM images, and more, using the tools that cloud providers offer. Ultimately, access to the cloud resources is split between the teams in the following way:
We are not going to get into the specifics of how to set up policies on various cloud providers, but feel free to reach out to us if you would like to learn more about this. It is important to note that the mileage on access may vary depending on the maturity of the development teams and the level of strictness of policies and external standards, such as PCI and HIPAA, that the company needs to comply with.
A separate important topic in properly setting up cloud infrastructure is choosing the proper regions to deploy applications. We assume that new applications and services deployed in the cloud need to integrate with the existing services deployed on premise datacenters, so network latency between the cloud and the datacenter is also a real concern. Moreover, major cloud vendors provide a number of regions across the US and around the world. In many cases, existing datacenters are geographically allocated or applications are deployed in an active-passive DR setup, where only one datacenter is active at any point in time. Due to the constraint of traveling faster than the speed-of-light, even with the best networking between cloud and private datacenters, poor geographical distribution may bring significant network latency overhead, as can be seen in this infographic:
When choosing a cloud provider or regions within the provider, network latency between the cloud and the datacenter need to be tested. Latency also needs to be taken into account when choosing which applications and services should be migrated to the cloud first. As a side note, network latency overhead is one of the reasons why we don’t recommend splitting stateless application components and databases between the cloud and the datacenter. This is because database communication protocol is typically chatty, and assumes a co-located deployment with application instances.
A base cloud infrastructure is enough to start deploying applications. However, when designing microservices, or at least good service-oriented systems, there is a set of application-independent capabilities that are not provided by pure Infrastructure-as-a-Service. In many cases, we see that companies decide to use a Platform-as-a-Service (PaaS) to increase the efficiency of their application development and support teams. The reference architecture for such a platform can be found below:
In addition to application-independent microservices management capabilities, such PaaS offerings provide a number of other services including databases, caches, messaging, and application servers. Although this sounds advantageous, PaaS technologies may soon become too restrictive, and the promised benefits will be overshadowed by the high level of customization that each application team requires, and by the increased complexity and cost of operations. While we believe that PaaS technologies are great for early experimentation and applications with low complexity, based on our experience, enterprise-wide adoption should be done only after careful analysis.
Instead of using a full-fledged Platform-as-a-Service, we recommend adopting a limited PaaS, or a microservices platform. Such a platform becomes a thin layer between the infrastructure and the application, and leaves the choice of databases, caches, message queues, application services, and other components to the applications. Instead, the platform focuses on application-independent capabilities, as shown below:
Before we get to the actual definition of a microservices platform, we would like to clarify what we mean by “microservices”. Interestingly, there is still a debate in the industry over the precise definition of what they are, and how they are different from services in a service-oriented architecture. Fortunately, the precise definition is not required for the purposes of this article, so any typical definition of a microservice will work. We are just going to make several assumptions, with the general layout of a microservice shown here:
Strictly speaking, taking into account the definition and assumptions above, a microservice in this case can also be an ETL or UI component. A microservices platform consists of a number of foundational capabilities that allow the efficient developing, building, deploying, and managing of such microservices, as seen in this infographic, with further explanation below:
Some of the other features that may be a part of the microservices platform include authentication and authorization, API management, and service mesh. We will return to service mesh later in this article, and hope to cover the other two in future blog posts.
The platform can be implemented with various technologies, but in practice there are two different approaches that are dictated by which method is used to package the deployment units: VM images or Docker containers. The table below provides a difference in technologies between the two options:
Although the two options look very different, conceptually they are very similar. We have experience working with both options, and found that everything in a container-based stack has a direct one-to-one mapping to a VM-based stack, and that it is relatively easy for developers proficient with one stack to switch to another one.
One feature of a microservices platform that deserves special discussion is the service mesh. A service mesh is needed for flexible traffic routing between microservices. In traditional production operations, there are two main options for upgrades: blue-green and rolling. While a rolling upgrade is easier to execute, it doesn’t allow for controlled A/B testing and canary releases, so it is often used for only small changes. When services need to be upgraded with a canary release and A/B testing, a variety of a blue-green upgrade is typically used. Performing a blue-green upgrade is easy when targeting the top-level customer-facing components and services. It gets significantly more difficult when services deep in the service mesh need to upgraded with top-level canary releases or A/B testing, as shown here:
When a service mesh is properly implemented, a service or component deep in the stack can be upgraded, and requests sent to the top-level services with specific headers will be properly routed to blue or green instances of the service. This allows the executing of top-level controlled experiments, including running end-to-end tests before routing customer traffic to the new version. This enables business users to explore the new version before it is made available for the general public, and to select a small group of customers to test the new functionality before a widespread roll-out.
There are a number of technologies available today that help with implementing a service mesh. Some of the popular ones include Linkerd, Envoy, and Istio.
When designing new systems in the cloud, it makes sense to consider three major areas of changes:
While the first two types typically go through a regular continuous delivery pipeline, environment-level configuration changes are often not even considered to be changes. For example, modifications in endpoints, secrets, and the scalability configuration for stateless applications do not require human intervention, and are supported by the platform taking into account preconfigured policies. Changes to feature flags should be considered in the same way as changes in data, as application business logic needs to be tested and ready to work with any allowed combination of feature flags.
Change management and continuous delivery for microservices is a separate, large topic. The format of this article doesn’t allow us to describe it in full detail, so we will just give an overview of the approach and some of its best practices:
In this article, we described how to build a microservices platform and design a proper change management process. By doing this, retailers can open the door to a massive migration to the cloud. A typical implementation allows the following benefits:
Additionally, the establishment of a microservices platform enables further development in replatforming, making all future processes easier.
Retailers all have similar issues with their monolithic black box platforms, which has pushed them towards migrating to microservices on the cloud. If this replatforming is not supported properly, the benefits of the move might not be realized. However, with the steps laid out here, migration should go smoothly and efficiently. Building a microservices platform, along with an accompanying change management process, can bring substantial benefits, as the results section listed above shows.
Of course, the topic of microservices and the building of infrastructure to support them is a broad one, and we can’t discuss every element on the subject. One of the aspects that we didn’t describe in detail in this article is the role of production operations, or Site Reliability Engineering. The microservices platform already provides some advanced capabilities for this, including auto-scaling and self-healing. We hope to return to this topic in further depth in one of our future blog posts.
In the following articles in the series, we will continue our discussion on how to replatform the rest of the services in the digital customer experience platform. These will include posts on cart and checkout, customer profiles, product information management, customer data management, and one on pricing, offers, and promotions. We will also describe how to achieve a true omnichannel experience by deploying the platform in physical stores. Look out for the upcoming articles on this blog, and if this post has sparked your interest, give us a call or leave us a comment!