Embracing eventual consistency in SoA networking
One of the fundamental design tenets of Envoy is eventual consistency, permeating nearly every aspect of the system from the threading model to service and configuration discovery (see also here and here for more info).
As powerful as the eventually consistent paradigm is when applied to SoA networking, I’ve noticed over the last several years that many find it counterintuitive. This is not surprising — eventual consistency makes sense in theory, but when practically applied to many real world problems becomes quickly confusing.
Modern database design is a great example of attempting to balance the performance gains of eventual consistency versus the programmer ease of strong consistency. The industry has seen usage trends move from consistent transactional RDBMS such as MySQL, to eventually consistent NoSQL systems such as DynamoDB, back to massively scalable consistent transactional RDBMS such as Spanner. This post from Google on why programmers should always pick strong consistency whenever possible is a great read on this topic. (As an aside I think Cloud Spanner is an absolutely game changing technology but that’s not the subject of this article.)
In this post I’m going to talk about why embracing eventual consistency in SoA networking is so important. I’m also going to explain some of the ways in which distributed system programmers will need to rethink how they evaluate and use eventually consistent service discovery and load balancing systems in order to extract full value.
Why eventual consistency?
Fundamentally, distributed systems are eventually consistent at their core:
- Compute nodes can die or temporarily lose connectivity at any time.
- Network paths can die or be temporarily broken, leading to routing changes and potentially split brain situations.
- Software deploys do not happen at the same time, whether due to rolling deploys, canary, cron jittering, or many other factors.
- Autoscaling changes the system topology frequently by design. Nodes do not shut down at the same time nor do they boot at the same time. As the industry moves increasingly towards FaaS, this fact will only become more pronounced.
See also this great post from Werner Vogels on eventual consistency for a deeper discussion of the underlying computer science.
Of course it’s possible to layer strong consistency on top of an eventually consistent base; Spanner is an engineering marvel that does just that. However, leader election, quorum, consensus algorithms, etc. all have costs in terms of engineering and operational complexity and more importantly latency increase. Strong consistency guarantees are not free. Reads and writes to Spanner might take an order of magnitude longer than if applied to an eventually consistent NoSQL system.
Stepping back to Envoy and base SoA networking and configuration discovery, is it possible for every node among potentially tens of thousands, hundreds of thousands, or millions of nodes to have a consistent view of the overall system? I would argue not. Yet, historically, many distributed networking systems have relied on strongly consistent backing stores (e.g., ZooKeeper) for holding membership and software version data. Scaling these stores to large data sizes is difficult and frequently leads to outages. So why bother?
When originally designing Envoy my goal was to embrace eventual consistency for core distributed networking and take it to its logical extreme:
- If membership data is not consistent, why can’t we cache it for tens of seconds in memory in each discovery service node?
- If membership data is cached for tens of seconds in each discovery service node, why can’t each Envoy cache it for tens of seconds?
- If each Envoy is caching data for tens of seconds, why does there need to be consistency within Envoy among each worker? (See my post on Envoy’s threading model for more information on this.)
The answer is that when it comes to membership and configuration for distributed networking there doesn’t need to be consistency anywhere because there can’t be consistency anywhere. Embracing eventual consistency everywhere yields a substantially simpler system that is higher in performance and easier to operate.
Consistency and the Envoy xDS APIs
I won’t rehash the Envoy APIs in the post. For a full treatment please see my discussion of the universal data plane API and Envoy’s documentation on dynamic configuration.
I will mention that for programming simplicity and performance reasons, all of the Envoy APIs are eventually consistent and distinct from each other even though they may logically relate. For example, a route served by the Route Discovery Service (RDS) API can refer to a cluster served by the Cluster Discovery Service (CDS) API. However, RDS does not explicitly depend on CDS. This means that if Envoy is connecting to different management servers for the RDS API and the CDS API, it’s easily possible that Envoy will be sent a route update for a cluster that has not yet been defined by the management server providing the CDS API.
The previous paragraph is precisely what many new users find confusing about the Envoy configuration and discovery model. They ask: if the APIs relate, and yet are eventually consistent and disjoint, won’t that lead to situations in which Envoy will do the wrong thing? Technically, it is true that eventually consistent API ingestion may lead to situations such as the one previously described in which a route refers to a cluster that does not yet exist. However, does this situation actually happen in the real world?
How are distributed load balancing and routing configuration applied in the real world?
One of the things I have been most fascinated by recently is observing how people evaluate systems such as Envoy and Istio. It’s human nature to start with simple tests to “kick the tires.” An initial user is likely to do two things immediately:
- Perform some type of “horse race” benchmark comparing Envoy to HAProxy, NGINX, or some other load balancer (i.e., seeing if 25K+ RPS of synthetic HTTP traffic can be pushed per CPU thread).
- Use Istio with Kubernetes or write a simple xDS server for Envoy and perform a small test using RDS and CDS to immediately shift traffic from cluster A to a brand new cluster B.
I’m not going to spend a lot of time discussing (1). I understand why new users do this, though it frustrates me given how extremely unrealistic these benchmarks are in determining real world efficacy (when is the last time you ran 25K RPS per thread of trivial requests and responses on a production node in a real system?). Performance is very important, but real world performance is extremely complex and is based on both the features in use as well as the performance of the system at P99 with real workloads. These facts mostly determine how much capacity margin is required to satisfy usage bursts.
(2) above is more interesting.


Figure 1 shows a graphical depiction of scenario (2) described above. In this scenario, RDS provides route information while CDS provides static cluster information (a real implementation would likely also use the Endpoint Discovery Service API but there is no need to make this example more complicated). Assume the following sequence of events:
- CDS provides the static endpoint definitions of cluster A.
- RDS provides a route mapping
/to cluster A. - The operator makes a change, immediately having CDS provide cluster B while removing cluster A, and having RDS provide a route mapping
/to cluster B.
What will happen in this scenario when using Envoy’s eventually consistent model? It is very likely that there will be a period of time in which Envoy either serves 404s or 503s (depending on configuration). This is because each component in the system eventually converges. This means that in a nondeterministic order, Envoy will eventually remove cluster A and replace it with cluster B and will eventually replace the route mapping / from cluster A to cluster B. During the intermediate period of convergence, the mappings may be invalid which cause the error responses.
Isn’t this a bug? Most users who initially experience this view it as such. And in fact, because this initial reaction from users is so common, most products also view it as a bug, even though the above scenario does not occur in the real world during operation of any well designed distributed system.
In the real world, roughly the following sequence of events occurs:
- Cluster B is created either manually or via an automated blue/green deploy process.
- Manual or automated testing makes sure that cluster B is up and ready to receive traffic.
- Traffic is slowly shifted from cluster A to cluster B initially via canary and finally by percentage rollout (thus cluster A, cluster B, and routes that point to them both exist at the same time).
- Cluster A and associated routes are finally removed when no more traffic is going to cluster A.
Notice that in the previous sequence of events, Envoy’s eventually consistent model of splitting APIs yields no adverse effects. The user of the system observes the behavior they ultimately expect even though the underlying implementation converges in a nondeterministic way.
The Envoy Aggregated Discovery Service (ADS) API


Because of the initial synthetic routing test that the great majority of users perform when they first use Envoy and control planes such as Istio, and the fact that most users think that what they are seeing is broken, products built on top of Envoy typically must provide more consistency than the basic Envoy APIs naturally provide. This is a fantastic example of product design tradeoffs. Sometimes products must supply behavior that will not be used in the real world purely because initial user reaction when performing unrealistic “kick the tires” tests is so strongly negative.
In response to this, we developed the Aggregated Discovery Service (ADS) API. This API, shown in figure 2, allows all of the other Envoy APIs to be multiplexed over a single gRPC stream, thus ensuring a single management server handles all of them. This fact allows the management server to sequence discovery updates in such a way that Envoy will never drop any requests. I.e., in the example above the management server can enforce the fact that cluster B is added, / is pointed to cluster B, and then cluster A is removed, all in a specific order.
However, there is no free lunch in computing. Developing an ADS based management server is substantially more complicated than a fully eventually consistent implementation using discrete RDS, CDS, EDS, etc. APIs. And to reiterate, the irony is that all of the complexity required for an ADS implementation is mainly to satisfy the initial reaction of users who perform a set of synthetic tests that will never happen in the real world.
Conclusion
Distributed system design is often a complex tradeoff between eventually and strongly consistent components. Eventual consistency typically yields better performance, easier operation, and better scalability while requiring programmers to understand a more complicated data model.
System design trends have also changed over time. As I described in the introduction, databases are now trending back to strong consistency after years of the industry preferring to build applications on top of eventually consistent stores and realizing how complex it is.
On the other hand, I believe that SoA networking has always been eventually consistent, and to pretend otherwise is to just fool ourselves. Instead, we should embrace the eventually consistent model for distributed networking. Part of embracing eventual consistency is evaluating systems in terms of how they will be used in production versus how we can quickly run a few sanity tests. I’m a realist; I don’t think this change will happen overnight, but given that ADS exists and will be implemented by some management servers and products purely to satisfy what I consider to be an unrealistic use-case means that we have a long way to go.