Evolving a Protocol Buffer canonical API
In the Envoy proxy, we adopted Protocol Buffers (protobuf), specifically proto3, as the canonical specification of our next generation APIs. These APIs describe how Envoy communicates with services on its control plane, for service discovery, logging, rate limiting, authorization and other similar roles. The contents of these APIs are discussed in significant depth in another post, below we briefly recap recent history and share some of the lessons gathered as these APIs evolved.
v1 → v2: A foundation for Envoy API growth
Today, Envoy has two variants of its APIs, v1 and v2. Understanding how we arrived at the v2 APIs sets the stage for discussion of how the protobuf based v2 APIs continue to evolve, and how a protobuf API can be adopted as a replacement for a legacy API in widespread use.
The legacy v1 API is REST-JSON polling based (with some limited use of gRPC) and evolved as Envoy came into being to meet the operational needs of its founding developers (Lyft) and early adopters. It is expressed through a combination of JSON schema definitions and handcrafted documentation.
When we set out to design the v2 API, intending it to be the long-term framework for expressing Envoy’s APIs with a view to future growth, we adopted a gRPC-first API. gRPC’s support for bidirectional streaming opened up new API and control plane possibilities, for example delegated health checking, advanced load balancing and stronger consistency models. By switching to gRPC, we made a clean break with v1 and did not attempt to provide backwards compatibility.
A lack of backwards compatibility might at first glance seem to be disruptive. Envoy’s v1 APIs were in production use at Lyft and also were widely adopted by the ecosystem surrounding Envoy, by systems such as Istio and Nelson.
We managed the breaking nature of the v2 transition by providing a much longer than usual deprecation window for v1, compared to other Envoy features, to allow for the gradual uptake of v2. While the v2 APIs have been production ready since the Envoy 1.5 release in December 2017, we have not yet begun formal deprecation of v1 (as of late January 2018) and expect it to be several quarters until this occurs.
To help encourage adoption of the v2 APIs, we have feature frozen the v1 API. This is a soft freeze; where there is an articulated necessity for a feature to be added to both v1 and v2, we continue to support this.
To minimize the disruption of maintaining two independent APIs in Envoy, we switched to the v2 data model internally and wrote v1 → v2 translation libraries that are invoked during configuration ingestion. For example, a Listener resource object specified in v1 JSON would be converted to a v2 Listener protobuf, prior to sharing with the rest of the Envoy code base.
The remainder of this post focusses on the v2 API exclusively.
The Envoy v2 API canon
Envoy’s pivot to gRPC APIs implied that its API had to be specified in protobuf. This would have been independently useful when designing an API to scale to Envoy’s growing needs; in contrast to JSON, protobufs provide static typing for API specification and clear rules for extensibility.
Envoy’s gRPC APIs consist of:
- Service definitions, defining the gRPC streaming endpoints (e.g. https://xds.foo.bar/envoy.api.v2.ListenerDiscoveryService/StreamListeners) and the types of messages sent and received on each stream. E.g.
- Message objects describing the resources exchanged on each stream, for example the Listener messages embedded in DiscoveryResponses.
Beyond the gRPC APIs, which provide dynamic configuration and services such as rate limiting, we captured Envoy’s static configuration in the v2 API protobuf definitions. The static proxy configuration file, known as the bootstrap configuration, is ingested from the filesystem at Envoy startup. It provides a set of static resources and points Envoy at its control plane services. While we could have maintained Envoy’s static JSON configuration file format when transitioning to the v2 gRPC API, it was much cleaner to also switch the bootstrap file format to protobuf. This allowed reuse of objects between the dynamic APIs and bootstrap. For example, a Listener resource can be both fetched using the ListenerDiscoveryService above, or provided in the bootstrap configuration file.
There was a recognized need to continue to support REST-JSON polling in the v2 API for Envoy API providers who did not wish to adopt gRPC. We could have continued to support the v1 REST-JSON API alongside the v2 gRPC API for this purpose, but this would have involved significant code duplication and violated the principle of DRY. Instead we opted to take advantage of proto3's well-defined JSON canonical mapping. Every proto3 message can be mapped to and from a JSON object, with an implementation of this mapping provided by the standard C++ protobuf library. In addition, there are HTTP annotations for service definition that can be used to specify REST endpoints, for example:
We utilized HTTP annotations and the proto3 JSON canonical mapping to allow the standard protobuf definitions for the gRPC API to be used to also specify the v2 REST-JSON API. For static representation of proto3 objects, we also supported YAML, via a YAML → JSON → proto3 translation chain.
There are two non-functional aspects of the v2 API that deserve mention:
- Protobuf annotations. While protobuf static typing is a clear win over regular JSON schema for both specifying and crafting API resources, the protobuf type system is strictly (much) weaker than that provided by the dynamic typing in JSON schema. For example, protobuf does not allow one to specify that a
uint32must be in the interval[3,56), while this is possible in JSON schema. The same applies when stating that astringfield should be non-empty, aDurationnon-zero or aoneofvalue provided. We combined both approaches by employing field annotations to specify these stronger constraints and generating C++ stubs with the protoc-gen-validate (PGV) protobuf compiler plugin to check these constraints at runtime. These annotations live in the Envoy v2 APIs, e.g.
- Documentation. We inlined the API documentation with the protobuf definitions in the v2 API, providing another DRY win. While in the v1 API, each resource object was both specified in JSON schema and then again in separate handcrafted documentation in a pseudo-JSON description, we wrote a tool to generate documentation directly from the protobuf definitions. This tool took the form of a Python protobuf compiler plugin called protodoc. We wrote a custom tool, since we wanted the output format to match the Sphinx reStructuredText (RST) used in the rest of the Envoy documentation (and the v1 APIs). Sphinx RST features such as internal link validation are very useful at scale. An example of this inlined Sphinx RST is provided below:
Protocol buffer wire compatibility rules
It is very rare that an API remains static and does not continue to evolve. Our initial thinking for the Envoy v2 APIs was that we would play by the rules of protobuf wire compatibility, and this would be enough to avoid breaking API providers and consumers.
Protobuf wire compatibility entails ensuring that a protobuf produced for an earlier version of the API is consumable by an implementation built using a later version of the API. The rules here are well understood. They include:
- Existing fields should not be renumbered.
- Existing fields should not have their types changed. There are some exceptions: (1) a regular field can be upgraded to an
oneof(2) a regular field can be upgraded to a repeated field. - New fields should not reuse any previously assigned field number.
- Enum defaults should be picked such that they make sense looking forward, or be set to UNSPECIFIED.
With these in mind, the v2 API could be considered a living API, with the ability to grow new fields, message types and gRPC or REST service endpoints.
A deprecation story was introduced, whereby fields that were to be removed would first be tagged as deprecated, then later removed following their removal from the Envoy proxy implementation according to Envoy’s breaking change policy.
Under this model, there is no explicit versioning. We used the package namespace envoy.api.v2 to provide a single coarse API version. We anticipate that if there is a v3, it will be a long time coming and be yet another clean break from the previous API version.
The above rules on protobuf wire compatibility do not capture a corner case that we recently encountered when reorganizing the API package namespaces in the v2 API. Usually it’s considered “safe” to change package namespaces, and freely move protobuf definitions between .proto files, since these are not part of the protobuf wire format. There is an exception to this when embedding opaque messages using the google.protobuf.Any type, since package namespaces manifest in type URLs, which are included in the message. There is a more complete post on how Envoy makes use of the Any type when specifying embedded objects elsewhere. Envoy relies on the embedded type URL to identify Any unpacking message types, hence any change to the package namespace for a message that appears directly in an Any field is a breaking change at the protobuf wire level for Envoy.
JSON/YAML wire compatibility rules
While the rules for protobuf wire compatibility are well understood, both within Google and in the wider OSS world, there’s an additional caveat that applies when using the JSON canonical transform from proto3 to JSON (and related YAML).
The additional consideration stems from the fact that the JSON wire format is far more verbose than the protobuf wire format. The protobuf wire format elides field names in message objects, since they are implied by the protobuf type known by both parties. Instead, a more efficient integer encoding of the field number is used. JSON includes the full field name in a key-value pair. This implies that if protobuf wire compatibility is your only consideration, it’s perfectly safe to rename fields at will, e.g. s/cluster_name/cluster_names/. However, field renaming is a JSON wire breaking change (and also a text proto breaking change). During recent changes, we were burned by “field renames considered harmless” thinking and broke JSON wire compatibility. We have clarified our API style rules to prevent this going forward.
Beyond wire compatibility
Wire compatibility is a necessary but not sufficient condition for not breaking API consumers as the API evolves.
- Servers on the control plane manifest their gRPC and REST endpoints at URL paths defined by the Envoy APIs. Changes to package namespaces affect gRPC endpoint paths, since these are derived from the package name, e.g.
/envoy.api.v2.ListenerDiscoveryService/StreamListeners. This is another reason to tread very carefully during package namespace changes when they contain service definitions. An interesting observation is that REST endpoints defined via HTTP annotations, as described above, do not suffer from this breakage, as their absolute URL path is defined by the supplied annotation string and not inferred from the package namespace, e.g./v2/discovery:listeners. - Field renames, movement of messages between
.protofiles,.protorenames and package namespace changes will force code consuming generated protobuf stubs to break. This is less serious than wire and service endpoint compatibility concerns, since it does not lead to client/server compatibility problems, but does result in code churn and resultant engineering cost born by projects consuming the APIs. Ideally this happens with low frequency and amplitude. - PGV annotations are not subject to any compatibility rules currently, they simply need to remain in lock step with Envoy’s implementation. This may need to be made more restrictive if they become widely used by code bases beyond Envoy.
We have broken all the above compatibility considerations in a series of recent changes. Some of these breakages we rolled back, others were considered less dangerous and allowed to persist. We have since codified these considerations in our API style guide to prevent any future violations with these lessons learned.
Lessons learned; don’t get burnt
The Envoy v2 API has been a significant success, opening up Envoy to a wide range of new features and use cases. We are very happy with the adoption of protobuf as the canonical definition for the dynamic APIs (gRPC, REST-JSON), static configuration (protobuf, JSON, YAML) and API documentation. This is now production ready since Envoy 1.5.
Our hard won TL;DR lessons on the journey to a production ready, stable backwards compatible protobuf canonical API can be captured with:
- Protobuf wire compatibility is not the full story for API backwards compatibility, even when the API is defined purely in terms of protobuf. Service endpoint stability, JSON/YAML wire compatibility and code churn in projects consuming the API are also significant considerations.
- Design the package namespace hierarchy carefully upfront. We initially placed many of our core message types in
envoy.api.v2with inadequate namespacing, since the APIs were small at the time. From little things, big things grow, and the above package namespace churn arose from our wish to split the core API objects, configuration and service definitions across new package namespace. - Do not change field names if JSON/YAML wire compatibility is desired.
- Do not change a message’s package namespace if it might appear directly embedded in an Any field.
- Adopt a protobuf style guide, such as https://github.com/envoyproxy/data-plane-api/blob/master/STYLE.md, early in API design process to maximize API consistency. This reduces the need for API churn.
- Learn where possible from the Google API design guide best practices.
- Adopt a breaking change policy for both implementation and API.
- Looking forward, a future actionable item is adding a set of cross-version API integration tests. These would likely have caught the issues described above.
Acknowledgements: The Envoy v2 API development has been a collaborative effort with major contributions and review from mattklein123, Anna Berenberg, Louis Ryan, Piotr Sikora, Kuat Yessenov, Hong Zhang, Purvi Desai, Rohit Bhoj and a large number of other significant contributors in the Envoy community. The above lessons in evolving an API stem from joint work with the team on the v2 APIs. Thanks also to Trevor Schroeder for helpful feedback on earlier drafts of this article.
Disclaimer: The opinions stated here are my own, not those of my company (Google).