"It's a Mess, a Service Mesh!"

The World Before Service Mesh, The Problem We Are Solving

In June of 2014 Kubernetes was pushed to Github - a whooping commit that would eventually change the technical landscape to what it is today. This came years, and some would say decades after the birth of the "microservices" push. which aimed to build applications in a Unix philosophy style. "Write programs that do one thing and do it well."

Before Kubernetes and Docker became widely popularized, microservices architecture existed. However, it was difficult to maintain and monolithic architecture had less burden on Operational teams. However, monolithic architecture was cumbersome for software engineers to add features to, update, or fix bugs within. The "new" usage of Linux containers allowed for smaller units of software to be easily deployable, without bumping into, what was then current but is now legacy, deployments. The ability to run both the future and past together aided in it's widespread adoption.

History of Service Mesh

In another area of the world in the 2010s, but we can say 2014 to keep with consistency here - the tech giants were at work. Their global platforms were starting to crumble under the load of their users, and their "streamlined" approach consisted of mainly three sections (front end, application, backend or state store). The front end was responsive, but the communication between the application and the state store would lag - leading to dropped requests or eventually toppling over the state store. The giants moved their focused to create more distributed, isolated, and independent layers - allowing for enhanced reliability. This came at a cost though, where all of these layers needed to be deployed with a proxy to be able to communicate with other services. These layers would be later called "microservices".

However, with these smaller units of software being deployed, how these applications talked to one other became a networking nightmare. As the "move fast, break things" idea spread, it became more and more difficult to ensure applications were connected to the correct backend, that traffic was monitored and logged, and latency recorded. The world wanted seamless blue green deployments, and with that the ability to swap traffic flow between applications with zero downtime.

We switched from chonky software that was predictable and hardcoded, to GLP-1-esque microservices which required the entire system to cater to it's ephemeral nature.

Our technical pioneers came from places like Bell Labs and IBM, but I would dare to say our "modern day" pioneers come from Netflix, Google, and Twitter. As Twitter was switching to this microservices architecture, it built a Scala-based RPC proxy. Finagle handled retry logic, load balancing, and reliability logic between the services.

In 2015, two twitter engineers left, founding a company called Buoyant, and creating Linkerd, the first service mesh. It focused on connections between microservices prioritizing security, and reliability. The term "service mesh" was coined to aid in marketing.

My home in Italy has concrete walls - coming from American housing I am used to houses built fast and made of cardboard. The concrete in my house mixed with the wood fills me with confidence that whatever weather comes running off the mountains, this house will withstand. However, it proves difficult for my Ubiquiti router to have a good connection to an entire side of my house - first world problems, I know. We bought an access point, which works as a mesh network. I will assume that access points creating a mesh network aided in the creation of the service mesh idea.

Current State

Nowadays we have a couple of main contends to the service mesh category.

Let's start with the OG, Linkerd.

Linkerd market's itself as a minimal, lightweight service mesh. The data plane is written in Rust - which we can assume was a choice to enhance speed. The control plane is written in Go, and Linkerd is deployed via Helm or command line as a series of CRDs in your cluster. Linkerd is focused on Kubernetes only, and has some of the lowest latency. It only runs as a sidecar. Linkerd prides itself on being easy to maintain, a "deploy and forget it", allowing you to adopt mTLS throughout the cluster without much overhead.

Consul will always have a soft spot in my heart being my first piece of software I've worked on. Before I joined the team, I watched Paul Bank's talk on Consul Connect about three or four times to try and grok it - this was before I could even "Hello World".

Consul Connect is great if you have already adopted the Hashicorp tooling ecosystem and you're running Consul. It will do hybrid VM and Kubernetes communication, so you are not locked into Kubernetes only. It is written in Golang only, which could be seen as a benefit or a hindrance. Consul focuses on being distributed by nature, leveraging the same type of state store used in Kubernetes (raft). Consul uses a gossip protocol which allows it to connect easily via multiple datacenters and regions. It's benchmarks show that it's architecture makes it a leader in enterprise system communication. However, because Consul Connect is focused on the Hashicorp ecosystem, it lacks some of the features Kubernetes users would expect, such as Gateway API support.

Istio was released in 2017 as a joint project between Google, IBM and Lyft. It is built upon Lyft's Envoy proxy, written in C++ with some Rust components (WASM-based filters and dynamic modules) and also leverages zTunnel which is explicitly written in Rust. It has the largest featureset of any of the service meshes listed, but with that comes a cost - as it can be difficult to manage and maintain. Istio allows you to choose between sidecar or ambient mode - sidecar being the traditional proxy that runs alongside your service, and ambient mode being more "lightweight" that runs through a node proxy.

Cilium was released in 2015 as a "container networking interface (CNI)" project. Cilium, similar to Consul, added the service mesh feature later on. It's written in Go and leverages eBPF, a linux kernel networking feature. This makes Cilium completely "sidecar free" and it is one of the best meshes for CPU consumption. It is incredibly performant, but as mentioned before, this has tradeoffs, and this tradeoff falls on the end user to have expertise in Day 2 operations.

Meshing it all together

There are other meshes out there that are less popularized or retired (RIP Open Service Mesh). With an increase of services being deployed, being able to assign them addresses and set rules on communication between them will always be a problem needed solving. If you're about to add a mesh though, I think it is always a good time to really look at what you currently need and adopt what suits that case best. Software seems like a popularity contest, but as a die hard Nomad lover, I think if we chose what was best for our needs instead of what was cool then we wouldn't have KubeCon at 10k+ attendees.

"It's a Mess, a Service Mesh!"

Keep Reading