Guest post by Justin Turner, Director of Engineering at H-E-B
Reinventing ourselves when it mattered most
2020 was a challenging year for many of us, both personally and professionally. The COVID-19 pandemic dominated our daily activities and rapidly changed how our society operates. The retail grocery industry was particularly hard hit with huge surges in demand for contactless interactions through curbside and home delivery.
One of those stores was H-E-B, a Texas and Mexico-based grocery retailer with 420+ locations. At the time H-E-B’s curbside and home delivery offerings were a well leveraged feature, the sudden increase in demand forced a reinvention of its fulfillment systems to meet the need and prioritize the safety of customers.
In early 2021, the curbside and home delivery engineering teams, which I lead, completed the reinvention of our systems into multiple containerized services that are deployed to several GKE Kubernetes clusters and networked together with Linkerd. We undertook this effort to increase the speed of feature delivery to meet the ever-increasing demands of our business and significantly improve the resiliency of the system.
When we began this journey we faced a challenging reality: we were working with a single, monolithic codebase burdened with technical debt that made it hard for us to move at the speed we desired. As we explain in this blog, our journey was transformative. Despite the mountain ahead of us, our efforts resulted in multiple deployments a day with a 99.99% reliability rate and change failure rate of less than 5%, all while delivering the high impact features our H-E-B curbside business needs. Most importantly we were able to accomplish all of this when it mattered most to our customers, during the pandemic.
Why cloud native services?
The pandemic functioned as a catalyst that forced us to reevaluate the priorities on our technical roadmap, and increased the urgency for us to rapidly deliver services and take the heat off our strained legacy monolith. The pandemic also drove the need for critical new functionality to deal with the challenges of higher usage of curbside, as well as the inventory challenges we were experiencing. We wanted to quickly deliver the new services so that we weren’t moving the goalpost for ourselves by enhancing only the monolith. If we didn’t deliver the new services quickly, the continued introduction of features would make parity between monolith and services much more difficult to reach, and would shift the timeline out significantly.
It’s important to note that when fulfilling a customer’s curbside/delivery order quickly and accurately, a symphony of events need to take place in our stores and fulfillment systems in a very short time. Any disruption to this process is costly since delays cascade from in-progress orders to future orders. Given our services are a lifeline to many Texans during the pandemic, when incidents occur the impact to our customers is very real.
The stakes were high
Given this, the requirements and risks associated with rebuilding our monolith into cloud native microservices were significant. The services had to be low latency and highly resilient to the realities of a complex ecosystem, all while threading the needle to ensure that we didn’t create a distributed monolith. Given all of this, we had to be very deliberate in our decision making when it came to the technologies and resilience engineering patterns we chose to adopt. Regarding foundational platform considerations, we decided that we would utilize Kubernetes and containerized workload. We also decided that we would use domain driven design to help us create reasonably sized services that were loosely coupled with each other. These decisions would be key to ensuring we built a product that was resilient to failure, easy to develop on, and had room to grow, change and be replaced as our curbside business needed.
In late 2019 we deployed the first of our new services to production and began building our operational competency. In late March 2020 we’d connected our critical services such as shopping and retrieval in a lower environment and used Locust, a load testing framework, to introduce the anticipated traffic. However, we began seeing complexity when issues occurred, and the investment that would be required to operate these services successfully was significant. Several of these issues were related to network disruptions, properly configuring retries, and understanding what occurred when a transaction failed midway through. This drove some refactoring to correct any domain boundary issues so we didn’t use anti-patterns like distributed transactions. We also invested heavily in our SRE team and learned how to best observe our services with the tools we had available.
Given our operating environment and the critical nature of our workloads we decided to add a service mesh. We needed to see the interactions between our workloads and our SREs wanted to get standardized metrics and observability for every app. After consulting with the platform team we moved adding a service mesh forward in our roadmap.
Evaluating service mesh offerings
In early 2020, we evaluated several service mesh offerings. We quickly determined that we would need something simple with minimal technical and operational overhead. Many of the offerings we looked at were highly complex and required a large effort to operate and understand. This is where Linkerd, the CNCF service mesh, stood out. Not only did we have Linkerd up and running in a test environment with our microservices in minutes, we were able to start evaluating meaningful capabilities quickly. Speed of evaluation and adoption was an important factor. The realities of the pandemic and customer demand drove an urgency to deliver.
In addition to ease of getting started, we found Linkerd’s service discovery, mTLS encryption and its real-time tap of service health metrics incredibly powerful. We learned that by shifting the ownership of service-to-service networking and security to a control plane,many of the issues we experienced were resolved. Writing service to service networking behavior, tuning it and securing it ourselves would have introduced high complexity within our codebases. Linkerd eliminated networking complexity overhead for us by only requiring configurations in our service profiles.
Based on our experience with it and compared with the steep learning curve required with other service mesh offerings, we made the decision to proceed with Linkerd. At this point, we set to work productionizing our new service mesh by moving control plane configurations into Terraform and configuring service profiles in our application codebases.
Learning and operating Linkerd
Initially, we manually inserted the service profiles to our existing services but later introduced the ability to generate a service profile template with default values when spinning up a brand new codebase using cookiecutter and a fork of the swagger maven plugin. This further reduced the cognitive burden engineers experienced when first working with Linkerd and configuring their services.
Our next step was to make sure the new curbside services and mesh were ready for primetime. To do this we applied chaos engineering practices against everything we could in our ecosystem, in addition to our other testing automations.
While we already chaos tested new services to ensure that our adopted resilience engineering patterns would work in case of failure, with the mesh we took an approach of breaking different components of the control plane to better understand it and build out documentation and runbooks where possible. While scaling suddenly to hundreds of pods, we broke several different components of the mesh and were surprised at how resilient the services and infrastructure were. We found that once injected, the Linkerd proxies continued being networked together, even when major components of the control plane—such as the destination and identity services—were challenged. The largest initial learning was how everything would behave if the Linkerd proxy-injector became unavailable. Unless we need to suddenly scale up, our fulfillment services will remain available. In the event that they don’t, we built out runbooks to quickly recover from this scenario. These chaos engineering activities boosted our confidence that our mesh and services were production ready.
Going live
In early July 2020, we cut over our first store from the monolith to our new microservices. Of course we found a number of issues as this new system was put into the hands of new users, but these were related to functionality defects and were quickly resolved by the engineers. Our services and infrastructure was resilient and relatively easy to understand and operate.
Within days we were building the skills needed to support our new curbside services as we rolled out to more stores. As more features were added, we found that the team was deploying code faster, the system was more reliable, and our confidence in what we had built was high. As a result, we increased the number of stores on the services, and took more traffic and strain off of the monolith, which increased overall availability.
The urgency to roll out had not subsided and as we were still experiencing instability with the on-premise monolith, the team found themselves supporting two very different systems. Our partners at the stores wanted the new services, but many locations were gated by our speed of delivery on remaining parity features, based on the functionality they required.
Becoming a high performance team
With our new capabilities and our ability to quickly understand issues and rollback with our new pipelines, our prior reservations about touching production were starting to diminish. In our legacy system, we had been afraid to touch production, now we had confidence in what we were changing. We had finally reached a point of continuous delivery, deploying multiple times a day. This did lead to an increase in our change failure rate to higher than 8-10% as we rolled back more releases when we made mistakes or had issues that our automated tests didn’t find. This metric drove us to invest in our change safety mechanisms.
As we looked at options to improve our change failure rate, we found that Linkerd opened the door to introducing canary analysis to our CI/CD pipeline. By utilizing the Flagger operator in our cluster with our GitLab runners and existing CI processes, we could progressively increase traffic to new deployments while monitoring the live metrics available in Linkerd. This drove our change failure rate to less than 5% while maintaining our on-demand deployments. Performance has remained at this level since the introduction of canary deployments.
We also found many benefits in early issue detection. Where our observability tools do a reasonable job of alerting us when an SLO is breached, the alerts tied to our Linkerd control plane health checks are the early warning that a critical issue has occurred. In many circumstances this has helped us resolve issues before our users see any impact. In this way, the Linkerd dashboard acts as a source of truth; letting us know if the alerts triggered by our observability tools are a false positive, require further tuning, or are indicative of something real and customer impacting.
With this foundation in place, we continued delivering and maintaining a measured high performance as a team and further invested in our development lifecycle. We built a dark canary environment that is 1:1 with our production environment. This is used to mirror production traffic to expose any data integrity or performance issues and test infrastructure upgrades to further eliminate risk to production. When upgrading Linkerd, we do it in lower environments and dark canary, and so far we have found that the Linkerd upgrade process is easy and doesn’t require much overhead. We tend to stay one or two minor versions behind latest with Linkerd and Kubernetes, but are starting to upgrade earlier in dark canary.
Continuously exceeding 99.99% uptime
By September of 2020 we had reached a majority of stores, and completed our rollout to all H-E-B Curbside stores early in 2021. The entire reinvention of our teams and systems took around 1.5 years to fully complete once engineering work began in August 2019.
Throughout the rollout, as traffic shifted, we saw a major decrease in issues in our legacy environment. As a team, this meant we were able to focus on continuously delivering business value vs. toiling on the monolith. The services we built and the technology we introduced accomplished our goals. Our store partners had the tools and technical capabilities needed to serve our customers when it mattered most.
The business results and impact speak for themselves. In 2020, our legacy monolith dipped below 99.9% uptime several times. Now our microservices are exceeding 99.99%. We also measure as a high performing team per the four key metrics (MTTR, lead time, change failure percent, deployment frequency). We are working iteratively towards achieving elite performance, per these DORA metrics, by eliminating all manual judgement from our CI/CD processes. We have high confidence in our operational capabilities, and are frequently delivering high impact business value to our internal business customers.
At the end of the day H-E-B didn’t choose Linkerd because it’s the coolest offering. We weren’t worried about which mesh had the most hype or most marketing behind it. We needed a service mesh that we could easily understand so we could focus on building the best possible platform. Or to say it another way we wanted a mesh that would just work and let us focus on our actual job. The ease of adoption, reduction of operational complexity, and the capabilities it unlocked ensured that Linkerd was a major factor in us reaching our goals quickly, especially when H-E-B Curbside and Home Delivery services are critical to so many Texans.