Lyft’s Envoy dashboards
I’ve given quite a few talks about observability in the age of the service mesh (most recent slides, unfortunately this talk series has not been recorded yet). Visibility into the inherently unstable network is one of the most important thing that Envoy provides and I’m asked repeatedly for the source of the dashboards that we use at Lyft. In the interest of “shipping” and getting something out there that can help folks, we are releasing a snapshot of our internal Envoy dashboards.
What we are releasing is unfortunately not going to be readily consumable. It is also not an OSS project that will be maintained in any way. The goal is to provide a snapshot of what Lyft does internally (what is on each dashboard, what stats do we look at, etc.). Our hope is having that as a reference will be useful in developing new dashboards for your organization.
Lyft’s dashboarding in 60s
In order to provide some context into what is being shared, I will very briefly describe Lyft’s observability and dashboard stack.
- All Envoys write stats in statsd format.
- We run statsrelay on each host.
- All of our stats are funneled to a pre-aggregation pipeline.
- The pre-aggregation pipeline ultimately writes stats out to Wavefront.
- Developers at Lyft look at dashboards in Grafana (we have a Wavefront plugin that pulls TSD).
- All dashboards at Lyft are created from SaltStack code (the Grafana SaltStack module is an approximation of what we use internally).
- We pre-generate dashboards for every service and also allow developers to add custom rows for business logic, etc.
Yann Ramin gave a great Monitorama presentation on Lyft’s observability stack if you would like more info (video/slides). We are also planning a more official Lyft engineering blog post on this topic. Please look forward to that.
What the SLS files include
The snapshot contains several SLS files that will be described below. At a high level, they include:
- Description of dashboard components include charts, queries, etc.
- Alarm setup (note that not all alarms that we use are described in the files we are providing).
- All of the queries use Wavefront syntax.
- All of our Envoy stats are internally prefixed with production.infra.aws.ec2.asg.envoy. When emitted to Wavefront, we tag stats with various things such as EC2 ASG, etc.
- For someone generally familiar with stats and dashboards, and with a little bit of effort, it should be straightforward to understand what everything means.
The Envoy dashboards
We utilize four primary Envoy dashboards at Lyft (please refer to my presentation slides for pictures):
Front/edge
These are our edge (“API gateway”) Envoys. They terminate TLS, perform auth and ratelimiting, and then route to backend services.
- Primary dashboard
- Utilizes macros
Global
This is a global view of all Envoys at Lyft. This gives a sense of overall network health across the entire infrastructure.
- Primary dashboard
- Utilizes macros
Service-to-service
This dashboard allows the user to select both the sending (egress) and receiving (ingress) service. Envoy stats are then populate for the specified network hop, allowing for a deep dive into the health of that hop.
- Primary dashboard
- Utilizes macros
Per-service
As I said above, we automatically generate a dashboard for every service at Lyft. The first two rows of this dashboard include Envoy stats. (Since 100% of services run via Envoy this is easy to setup).
- The per-service dashboard is not included, but the envoy rows are pulled from the same macros as the other dashboards.
Future work
Ultimately, I would love for there to be an OSS solution for Envoy observability. Please reach out to me if you are interested in helping with such a thing. In the meantime, I hope this is a useful reference for folks building Envoy systems and trying to figure out what stats to look at.