Securing the Service Mesh with SPIRE 0.3
(Cross posted @ Scytale.io)
tl;dr this post is details one of the highlights of the SPIRE 0.3 release: enabling operations engineering teams to use SPIRE, an open-source software (OSS) reference implementation of the burgeoning SPIFFE specifications, to deploy a secure service mesh using the Lyft Envoy service proxy.
As an organization incrementally shifts development from “monolithic” applications to distributed microservices, new challenges arise. One we hear consistently is how ‘should’ microservices (“workloads” from hereon in) discover, authenticate, and securely connect to each other across potentially untrusted networks?
Addressing this per workload, while possible, is cumbersome for a software developer, and increasingly difficult to manage with more workloads. Instead, organizations small and large are beginning to use proxies (like Envoy from Lyft and Linkerd from Buoyant) to handle discovery, authentication, and encryption on a workload’s behalf. When workloads connect this way, the resulting design pattern is popularly described as a service mesh.
Let’s say workloads A and B want to securely communicate with each other. In this world, their corresponding proxies must first know their workloads’ identities. If A wants to establish an encrypted mTLS connection with B, say, then A must prove its identity to B using a private key-encrypted nonce, while B must verify A’s identity using credentials like an X.509 certificate associated with the private key.
Doing this work manually suffices for simple scenarios, but it’s not always scalable. For instance, it’s cumbersome when a workload’s underlying infrastructure elastically scales (if the infrastructure is part of an AWS auto-scaling group, for example) or is dynamically placed (if deployed on a Kubernetes cluster, for example). This is exacerbated because software developers must agree 1) on an identity format and 2) how said identity is encoded within credentials.
What if these credentials are stolen? In such a scenario, an attacker could assume the workload’s identity and send/receive messages on its behalf. As such, it’s important these credentials:
- are not stored separately from the workload (where they might be compromised).
- are rotated frequently, such that if they are compromised, are of very limited use to an attacker.
I’m working on SPIRE, which delivers infrastructure that addresses these concerns. At its heart is a toolchain that automatically issues and rotates authorized credentials. Operations engineers must first describe workload(s) in terms of an attestation policy. Here’s an example policy paraphrased in English:
“to be granted the identity spiffe://acme.com/Blog, a workload must prove it is running on an Amazon EC2 instance within security group sg-a33873d1, and is running in a Kubernetes pod labelled service-blog ”
SPIRE checks each workload against these policies and issues matching SPIFFE identities. It allows teams to describe and verify policies for workloads running on varying infrastructure types, including bare metal, public cloud (like AWS), and container platforms (like Kubernetes).
An example
In this example, my service mesh is comprised of Envoy and SPIRE. I connect two workloads — a flaskBB blog and a MariaDB database — running on two Amazon EC2 instances. Envoy establishes, authenticates, and encrypts the connection, and SPIRE automatically generates per-instance credentials.

Let’s go deeper:
The SPIRE Server maintains the canonical registry of workload identities and attestation policies. In my example, the database’s identity is spiffe://example.org/Database, while the blog’s identity is spiffe://example.org/Blog. The Server also exposes the Registration API that can be used to add, remove, and edit a policy. In this example, the Server runs on an Amazon EC2 instance separate from the workloads.
The SPIRE Agent runs as a daemon on the same Amazon EC2 instance as a workload. It locally exposes the SPIFFE Workload API that a process on the same instance calls to request its identity (and identity documents; namely its private key) and X509-SVID certificates needed to verify other workloads.
Since the blog, database, nor Envoy proxy natively support the Workload API, in this example, we’ll use the SPIFFE Sidecar to help us. This runs as a daemon process on the same Amazon EC2 instance as the workload. It retrieves credentials from the Agent’s Workload API, stores them on disk, and signals to Envoy to restart when new credentials are available. SPIRE-issued certificates have a one-hour expiry by default. The Sidecar tracks said expiry and automatically calls the Workload API for fresh ones.
Since the Sidecar process is what’s calling the Workload API, it is considered a workload for attestation purposes. The Sidecar does this on behalf of Envoy, which, in turn, acts on behalf of the blog and database workloads. I can do this reasonably safely since the actual workload, Envoy, and Sidecar share the same isolation boundary. For tighter coupling, I could have Envoy or the workload communicate directly with the Workload API — something I’ll explore in a future post.
Envoy connects, authenticates, and establishes a mutually-authenticated TLS connection between proxied workloads. Via the Sidecar, Envoy retrieves the 1) requisite private keys to establish an mTLS connection between workloads; and 2) X509-SVID certificates to verify ingress connections. SPIRE ensures workloads receive authorized private keys. In our example, the /Database proxy is configured to only accept connections from the /Blog proxy.
Finally, I have the workloads themselves. In the example, my blog and database communicate via Envoy. Rather than have the blog track the database’s IP address, authenticate the connection, and then encrypt it, the blog instead delegates to Envoy these tasks. The blog opens a MySQL connection to a pre-configured port on localhost, and Envoy routes it to the database.
Setup
To try this yourself, clone the spiffe-example repository. We use Terraform to automatically create the requisite infrastructure, then use one of these authentication methods to provide AWS credentials. Running `make up` from within the cadfael directory invokes Terraform’s apply command, which sets our infrastructure’s state as defined by the aws.tf terraform plan in the repo’s ec2 directory. This creates:
- One Amazon VPC isolating the example Amazon EC2 instances, allowing the Envoy proxies to be discoverable via their private IP addresses. The VPC’s default CIDR block is set to 10.70.0.0/24.
- Three Amazon EC2 instances with the following static private IP addresses:
* 10.70.0.10 (for the blog)
* 10.70.0.20 (for the database)
* 10.70.0.30 (for the SPIRE Server)
Each Amazon EC2 instance is pre-installed with the the relevant software components. To ease identification, the instances are also labelled with the prefix spire, and with one of the suffixes server, blog, or database.
Configuring SPIRE
A SPIRE Server is bound to a trust domain, which represents workloads that implicitly trust each other’s identity documents. I’ll use example.org here.
For workloads within a trust domain, SPIRE uses an attestation policy to decide which identity to issue to a requesting workload. Such policies are typically described in terms of the 1) infrastructure hosting the workload (node attestation); and 2) OS process ‘hooks’ to identify the workload running upon it (process attestation).
For this example, SPIRE will identify workloads via the following mechanisms:
- Node: The Unix user ID the SPIFFE Sidecar is running as (UID 1000)
- Process: The Amazon EC2 instance ID the workload is running on
SPIRE supports flexible attestation policies through plugins. The Unix workload attestor plugin packaged with SPIRE allows us to define policies based on Unix user IDs. To identify specific Amazon EC2 instances, we use the aws-iid-attestor (which must currently be installed separately).
On the SPIRE Server, we register the workloads with the following commands (substituting the AWS account ID and Amazon EC2 instance ID of the instance running each workload, as appropriate):
$ spire-server register \
-parentID spiffe://example.org/spire/agent/aws_iid_attestor/<aws_account_id>/<database_ec2_instance_id> \
-spiffeID spiffe://example.org/Database \
-selector unix:uid:1000$ spire-server register \
-parentID spiffe://example.org/spire/agent/aws_iid_attestor/<aws_account_id>/<blog_ec2_instance_id> \
-spiffeID spiffe://example.org/Blog \
-selector unix:uid:1000
Configuring the Sidecar
The SPIFFE Sidecar must be configured to read new credentials from the Workload API, place them on disk so Envoy can read them, and signal to Envoy that new credentials are available. The Sidecar running on the Amazon EC2 instances of both the blog and database are configured identically.
/opt/sidecar/bin/sidecar_config.hcl on the spire-blog and spire-database instances:
agentAddress = “/tmp/agent.sock” # Where to find the workload
# API exposed by the agent
cmd = “./hot-restarter.py” # Command to call when a new
# certificate is reloaded.
cmdArgs = “start_envoy.sh”
certDir = “/certs” # Directory that the agent
# should use to write the
# certs to disk.
renewSignal = “SIGHUP”
svidFileName = “svid.pem” # Filename to use for the
# workload’s SVID
svidKeyFileName = “svid_key.pem” # Filename to use for the
# workload’s PK
svidBundleFileName = “svid_bundle.pem” # Filename to use for the
# bundle of SVIDs that can be
# used to verify other
# workloads exposed by agent
The Sidecar write into the /certs directory the private key used by Envoy to encrypt egress connections, and the certificates needed to verify other ingress connections. It then calls `hot-restarter.py` (a Python script provided by Envoy) with `start_envoy.sh` as an argument. Envoy provides this hook to hot restart, triggering it to load newer svidFileName, svidKeyFileName, and svidBundleFileName files without dropping existing connections. It will monitor the retrieved certificates’ TTLs and repeat this process once the TTL on the SVID bundle is halfway to its expiry.
Configuring Envoy & the workloads
The Envoy proxy listener and Cluster ssl_context are configured to point to the credentials retrieved by the Sidecar. Here is an excerpt of ssl_context from the envoy.json configured to load the certificate, private key, and CA certificate bundle. Observe that these files’ paths match the Sidecar configuration:
A fragment of /opt/sidecar/bin/envoy.json on the spire-blog and spire-database instances:
…
“ssl_context”: {
“cert_chain_file”: “/certs/svid.pem”,
“private_key_file”: “/certs/svid_key.pem”,
“ca_cert_file”: “/certs/svid_bundle.pem”,
…
…
]
}
…
The proxy on the spire-blog instance is configured to establish a connection with the database workload. The blog’s SQLAlchemy configuration is pointed to local port 8003, where Envoy is listening and proxies traffic to the database instance.
On the spire-blog instance, Envoy’s ssl context is configured to load the blog’s SVID certificate with SPIFFE ID: spiffe://example.com/Blog . The database’s SPIFFE ID is configured in the verify_subject_alt_name attribute..
/opt/sidecar/bin/envoy.json on the spire-blog instance
{
“listeners”: [{
“address”: “tcp://0.0.0.0:8003”,
“filters”: [{
“type”: “read”,
“name”: “tcp_proxy”,
“config”: {“
stat_prefix”: “blog”,
“route_config”: {
“routes”: [{
“cluster”: “database”
}]
}
}
}]
}],
“admin”: {
“access_log_path”: “/tmp/admin_access.log”,
“address”: “tcp://0.0.0.0:9901”
},
“cluster_manager”: {
“clusters”: [{
“name”: “database”,
“connect_timeout_ms”: 250,
“type”: “strict_dns”,
“lb_type”: “round_robin”,
“hosts”: [{
“url”: “tcp://10.70.0.20:8002”
}],
“ssl_context”: {
“cert_chain_file”: “/certs/svid.pem”,
“private_key_file”: “/certs/svid_key.pem”,
“ca_cert_file”: “/certs/svid_bundle.pem”,
“ecdh_curves”: “X25519: P - 256: P - 521: P - 384”,
“verify_subject_alt_name”: [“spiffe: //example.org/Database”
]
}
}]
}
}
The Envoy listener on the database instance is configured to listen on port 8002 and forward ingress requests to the database process running on localhost port 3306. Here, the listener handles the TLS handshake, and, using the database SVID certificate, verifies the SAN field matches the SPIFFE ID example.com/Blog.
/opt/sidecar/bin/envoy.json on the spire-database instance
{
“listeners”: [{
“address”: “tcp://0.0.0.0:8002”,
“filters”: [{
“type”: “read”,
“name”: “tcp_proxy”,
“config”: {
“stat_prefix”: “database”,
“route_config”: {
“routes”: [{
“cluster”: “blog”
}]
}
}
}],
“ssl_context”: {
“cert_chain_file”: “/certs/svid.pem”,
“private_key_file”: “/certs/svid_key.pem”,
“ca_cert_file”: “/certs/svid_bundle.pem”,
“ecdh_curves”: “X25519: P - 256: P - 521: P - 384”,
“verify_subject_alt_name”: [“spiffe: //example.org/Blog”
]
}
}],
“admin”: {
“access_log_path”: “/tmp/admin_access.log”,
“address”: “tcp://0.0.0.0:9901”
},
“cluster_manager”: {
“clusters”: [{
“name”: “blog”,
“connect_timeout_ms”: 250,
“type”: “strict_dns”,
“lb_type”: “round_robin”,
“hosts”: [{
“url”: “tcp://127.0.0.1:3306”
}]
}]
}
}
Hopefully, this post gave you a taste of how you can use SPIRE to deliver robust, PKI in a real-world setting. The SPIFFE community is working hard every day to improve these projects to make such integrations even easier in the future. If you’d like to get involved, come join us!