Hi! This is a blog post sharing some low-level Linux networking we're doing at Modal with WireGuard.
As a serverless platform we hit a bit of a tricky tradeoff: we run multi-tenant user workloads on machines around the world, and each serverless function is an autoscaling container pool. How do you let users give their functions static IPs, but also decouple them from compute resource flexibility?
We needed a high-availability VPN proxy for containers and didn't find one, so we built our own on top of WireGuard and open-sourced it at https://github.com/modal-labs/vprox
Let us know if you have thoughts! I'm relatively new to low-level container networking, and we (me + my coworkers Luis and Jeffrey + others) have enjoyed working on this.
Thanks. We did check out Tailscale, but they didn't quite have what we were looking for: some high-availability custom component that plugs into a low-level container runtime. (Which makes sense, it's pretty different from their intended use case.)
Modal is actually a happy customer of Tailscale (but for other purposes). :D
So if a company only needs an outbound VPN for their road warriors and not an inbound VPN to access internal servers, vprox could be a simpler alternative to Tailscale?
We use gVisor! It's an open-source application security sandbox spun off from Google. We work with the gVisor team to get the features we need (notably GPUs / CUDA support) and also help test gVisor upstream https://gvisor.dev/users/
It's also used by Google Kubernetes Engine, OpenAI, and Cloudflare among others to run untrusted code.
We don't use Kubernetes to run user workloads, we do use gVisor. We don't use MIG (multi-instance GPU) or MPS. If you run a container on Modal using N GPUs, you get the entire N GPUs.
Re not using Kubernetes, we have our own custom container runtime in Rust with optimizations like lazy loading of content-addressed file systems. https://www.youtube.com/watch?v=SlkEW4C2kd4
Yep! This is something we have internal tests for haha, you have good instincts that it can be tricky. Here's an example of using that for multi-GPU training https://modal.com/docs/examples/llm-finetuning
Okay, well think very deeply about what you are saying about isolation; the topology of the hardware; and why NVIDIA does not allow P2P access even in vGPU settings except in specific circumstances that are not yours. I think if it were as easy to make the isolation promises you are making, NVIDIA would already do it. Malformed NVLink messages make GPUs fall off the bus even in trusted applications.
Sorry, just realized the misunderstanding. To clarify Modal still uses the kernel WireGuard module. The userspace part that’s in Go and not in other languages that we use is the wgctrl library.
this is a really neat writeup! the design choice to make each "exit node" control the local wireguard connections instead of a global control plane is pretty neat.
an unfinished project I worked on (https://github.com/redpwn/rvpn) was a bit more ambitious with a global control plane and I quickly learned supporting multiple clients especially anything networking related is a tarpit. the focus on linux / aws specifically here and the results achievable from it are nice to see.
networking is challenging and this was a nice deep dive into some networking internals, thanks for sharing the details :)
Thanks for sharing. I'm interested in seeing what a global control plane might look like, seems like authentication might be tricky to get right!
Controlling our worker environment (like `net.ipv4.conf.all.rp_filter` sysctl) is a big help for us since it means we don't have to deal with the fullness of all possible network configurations.
Thanks for sharing. This new feature is neat! It might sound a bit out there, but here's a thought: could you enable assigning unique IP addresses to different serverless instances? For certain use cases, like web scraping, it's helpful to simulate requests coming from multiple locations instead of just one. I think allowing requests to originate from a pool of IP addresses would be doable given this proxy model.
It's not quite a zero-trust solution though due to the CA chain of trust.
mTLS is security at a different layer though than IP source whitelisting. I'd say that a lot of companies we spoke to would want both as a defense-in-depth measure. Even with mTLS, network whitelisting is relevant. If your certificate were to be exposed for instance, an attacker would still need to be able to forge a source IP address to start a connection.
If mTLS is combined with outbound connections, then IP source whitelisting is irrelevant; the external network cannot connect to your resources.
This (and more) is exactly what we (I work on it) built with open source OpenZiti, a zero trust networking platform. Bonus points, it includes SDKs so you can embed ZTN into the serverless function, a colleague demonstrated it with a Python workload on AWS - https://blog.openziti.io/my-intern-assignment-call-a-dark-we....
I'd put it in the zero-trust category if the server (or owner of the server, etc) is the issuer of the client certificate and the client uses that certificate to authenticate itself, but I'll admit this is a pedantic point that adds nothing of substance. The idea being that you trust your issuance of the certificate and the various things that can be asserted based on how it was issued (stored in TPM, etc), rather than any parameter that could be controlled by the remote party.
JWT/OIDC, where the thing you're authenticating to (like MongoDB Atlas) trusts your identity provider (AWS, GCP, Modal, GitLab CI). It's better than mTLS because it allows for more flexibility in claims (extra metadata and security checks can be done with arbitrary data provided by the identity provider), and JWTs are usually shorter lived than certificates.
A db connection driver? You pass the JWT as the username/password which contains the information about your identity and is signed by the identity provider that the party you're authenticating to has been configured to trust.
Or, you use a broker like Vault to which you authenticate with that JWT, and which generates a just in time ephemeral username/password for your database, which gets rotated at some point.
Completely agree. IP addresses are almost never a good means of authentication. It results in brittle and inflexible architecture as well. Applications become aware of layers they should be abstracted from
Firewalls exist, many network environments block everything not explicitly allowed.
Authentication is only part of the problem, networks are firewalled (with dedicated appliances) and segmented to prevent lateral movement in the event of a compromise
It's not authentication. People aren't using static ips for authentication purposes
But if I have firewall policies that allow connections only to specific services I need a destination address and port (yes, some firewalls allow host names but there's drawbacks to that)
> IP addresses aren't authenticated, they can be spoofed
For anything bidirectional you'd need the client to have a route back to you for that address, which would require you compromising some routers and advertising it via BGP etc.
You can spoof addresses all you want but it will generally not do much for a stateful protocol
> People aren't using static ips for authentication purposes
Lol. Of course they do. In fact, it's the only viable way to authenticate servers in Current Year. Unlike ssh host keys, of which literally nobody on this planet takes seriously, or https certificates which is just make-work security theater.
> Modal has an isolated container runtime that lets us share each host’s CPU and memory between workloads.
Looks like Modal hosts workloads in Containers, not VMs. How do you enforce secure isolation with this design? A single kernel vulnerability could lead to remote execution on the host, impacting all workloads . Am I missing anything?
Yeah, great question. This came up at the beginning of design. A lot of our customers specifically needed IPv4 whitelisting. For example, MongoDB Atlas (a very popular database vendor) only supports IPv4. https://www.mongodb.com/community/forums/t/does-mongodb-atla...
The architecture of vprox is pretty generic though and could support IPv6 as well.
I guess that works until other customers need access to IPv6-only resources… (e.g.: we've stopped rolling IPv4 to any of our CI. No IPv6, no build artifacts…)
In a perfect world I'd also be asking whether you considered NAT64, but unfortunately I'm well aware that's a giant world of pain to get to work on Linux (involving either out-of-tree Jool, or full-on VPP)
At my company (Fortune 100), we've been selling a lot of our public v4 space to implement... RFC1918 space. We've re-IP'd over 50,000 systems so far to private space. We just implemented NAT for the first time ever. I was surprised to see how far behind some companies are.
Couldn't a NAT instance in-front of containers accomplish this as well (assuming only needed for outbound traffic)? The open source project fck-nat[1] looks amazing for this purpose.
Right, vprox servers act as multiplexed NAT instances with a VPN attached. You do still need the VPN part though since our containers run around the world, in multiple regions and availability zones. Setting the gateway to a machine running fck-nat would only work if that machine is in the same subnet (e.g., for AWS, in one availability zone).
The other features that were hard requirements for us were multi-tenancy and high availability / failover.
By the way, fck-nat is just a basic shell script that sets the `ip_forward` and `rp_filter` sysctls and adds an IP masquerade rule. If you look at vprox, we also do this but build a lot on top of it. https://github.com/modal-labs/vprox
Ahh that makes sense. I do think that a single fck-nat instance can service multiple AZ's though in a AWS region. Just need to adjust the VPC routing table. Thanks for the reply and info.
Hi! This is a blog post sharing some low-level Linux networking we're doing at Modal with WireGuard.
As a serverless platform we hit a bit of a tricky tradeoff: we run multi-tenant user workloads on machines around the world, and each serverless function is an autoscaling container pool. How do you let users give their functions static IPs, but also decouple them from compute resource flexibility?
We needed a high-availability VPN proxy for containers and didn't find one, so we built our own on top of WireGuard and open-sourced it at https://github.com/modal-labs/vprox
Let us know if you have thoughts! I'm relatively new to low-level container networking, and we (me + my coworkers Luis and Jeffrey + others) have enjoyed working on this.
Neat. I am curious what notable differences there are between Modal and Tailscale.
Thanks. We did check out Tailscale, but they didn't quite have what we were looking for: some high-availability custom component that plugs into a low-level container runtime. (Which makes sense, it's pretty different from their intended use case.)
Modal is actually a happy customer of Tailscale (but for other purposes). :D
So if a company only needs an outbound VPN for their road warriors and not an inbound VPN to access internal servers, vprox could be a simpler alternative to Tailscale?
You're using containers as a multi-tenancy boundary for arbitrary code?
We use gVisor! It's an open-source application security sandbox spun off from Google. We work with the gVisor team to get the features we need (notably GPUs / CUDA support) and also help test gVisor upstream https://gvisor.dev/users/
It's also used by Google Kubernetes Engine, OpenAI, and Cloudflare among others to run untrusted code.
Are these the facts?
- You are using a container orchestrator like Kubernetes
- You are using gVisor as a container runtime
- Two applications from different users, containerized, are scheduled on the same node.
Then, which of the following are true?
(1) Both have shared access to an NVIDIA GPU
(2) Both share access to the NVIDIA GPU via CUDA MPS
(3) If there were 2 or more MIGs on the node with a MIG-supporting GPU, the NVIDIA container toolkit shim assigned a distinct MIG to each application
We don't use Kubernetes to run user workloads, we do use gVisor. We don't use MIG (multi-instance GPU) or MPS. If you run a container on Modal using N GPUs, you get the entire N GPUs.
If you'd like to learn more, you can check out our docs here: https://modal.com/docs/guide/gpu
Re not using Kubernetes, we have our own custom container runtime in Rust with optimizations like lazy loading of content-addressed file systems. https://www.youtube.com/watch?v=SlkEW4C2kd4
If the nvidia driver has a bug, can one workload access data of another running on the physical machine?
E.g. it came up in this thread: https://news.ycombinator.com/item?id=41672168
Yes. The kernel has access to data from every workload, and so technically a bug in _anything_ running at kernel level could result in data leakage.
Suppose I ask for two H100s. Will I have GPU P2P capabilities?
Yep! This is something we have internal tests for haha, you have good instincts that it can be tricky. Here's an example of using that for multi-GPU training https://modal.com/docs/examples/llm-finetuning
Okay, well think very deeply about what you are saying about isolation; the topology of the hardware; and why NVIDIA does not allow P2P access even in vGPU settings except in specific circumstances that are not yours. I think if it were as easy to make the isolation promises you are making, NVIDIA would already do it. Malformed NVLink messages make GPUs fall off the bus even in trusted applications.
Yes it will.
(I work at Modal.)
As a predominantly rust shop, why choose go for this?
Wireguard's premiere user space implementation is in go.
Premiere = official? Do you know how it compares to this by Cloudflare? https://github.com/cloudflare/boringtun
Sorry, just realized the misunderstanding. To clarify Modal still uses the kernel WireGuard module. The userspace part that’s in Go and not in other languages that we use is the wgctrl library.
This is it. I like Rust a lot, but you gotta pick the right tool for the job sometimes
this is a really neat writeup! the design choice to make each "exit node" control the local wireguard connections instead of a global control plane is pretty neat.
an unfinished project I worked on (https://github.com/redpwn/rvpn) was a bit more ambitious with a global control plane and I quickly learned supporting multiple clients especially anything networking related is a tarpit. the focus on linux / aws specifically here and the results achievable from it are nice to see.
networking is challenging and this was a nice deep dive into some networking internals, thanks for sharing the details :)
Thanks for sharing. I'm interested in seeing what a global control plane might look like, seems like authentication might be tricky to get right!
Controlling our worker environment (like `net.ipv4.conf.all.rp_filter` sysctl) is a big help for us since it means we don't have to deal with the fullness of all possible network configurations.
Thanks for sharing. This new feature is neat! It might sound a bit out there, but here's a thought: could you enable assigning unique IP addresses to different serverless instances? For certain use cases, like web scraping, it's helpful to simulate requests coming from multiple locations instead of just one. I think allowing requests to originate from a pool of IP addresses would be doable given this proxy model.
So much work seems to go into working around the limitations of IPv4 instead of towards a fully IPv6 capable world.
Unfortunately we gotta do both. Overlay networks like wireguard might be a good stepping stone to move software towards IPv6 anyway
Static IPs for allowlists need to die already. Its 2024, come on, surely we can do better than this
What would you suggest as an alternative?
a more modern, zero-trust solution like mTLS authentication
That makes sense, mTLS is great. Some services like Google Cloud SQL are really good about support for it. https://cloud.google.com/sql/docs/mysql/configure-ssl-instan...
It's not quite a zero-trust solution though due to the CA chain of trust.
mTLS is security at a different layer though than IP source whitelisting. I'd say that a lot of companies we spoke to would want both as a defense-in-depth measure. Even with mTLS, network whitelisting is relevant. If your certificate were to be exposed for instance, an attacker would still need to be able to forge a source IP address to start a connection.
If mTLS is combined with outbound connections, then IP source whitelisting is irrelevant; the external network cannot connect to your resources.
This (and more) is exactly what we (I work on it) built with open source OpenZiti, a zero trust networking platform. Bonus points, it includes SDKs so you can embed ZTN into the serverless function, a colleague demonstrated it with a Python workload on AWS - https://blog.openziti.io/my-intern-assignment-call-a-dark-we....
I'd put it in the zero-trust category if the server (or owner of the server, etc) is the issuer of the client certificate and the client uses that certificate to authenticate itself, but I'll admit this is a pedantic point that adds nothing of substance. The idea being that you trust your issuance of the certificate and the various things that can be asserted based on how it was issued (stored in TPM, etc), rather than any parameter that could be controlled by the remote party.
JWT/OIDC, where the thing you're authenticating to (like MongoDB Atlas) trusts your identity provider (AWS, GCP, Modal, GitLab CI). It's better than mTLS because it allows for more flexibility in claims (extra metadata and security checks can be done with arbitrary data provided by the identity provider), and JWTs are usually shorter lived than certificates.
We have a native OIDC integration at Modal, as well! Every container gets a token. https://modal.com/docs/guide/oidc-integration
Awesome, great for you. OIDC/JWT for cross-stuff auth should become the norm.
[flagged]
How do you allow a driver using that exactly?
A db connection driver? You pass the JWT as the username/password which contains the information about your identity and is signed by the identity provider that the party you're authenticating to has been configured to trust.
Or, you use a broker like Vault to which you authenticate with that JWT, and which generates a just in time ephemeral username/password for your database, which gets rotated at some point.
Completely agree. IP addresses are almost never a good means of authentication. It results in brittle and inflexible architecture as well. Applications become aware of layers they should be abstracted from
Firewalls exist, many network environments block everything not explicitly allowed.
Authentication is only part of the problem, networks are firewalled (with dedicated appliances) and segmented to prevent lateral movement in the event of a compromise
Isn’t that completely orthogonal? IP addresses aren’t authenticated, they can be spoofed
It's not authentication. People aren't using static ips for authentication purposes
But if I have firewall policies that allow connections only to specific services I need a destination address and port (yes, some firewalls allow host names but there's drawbacks to that)
> IP addresses aren't authenticated, they can be spoofed
For anything bidirectional you'd need the client to have a route back to you for that address, which would require you compromising some routers and advertising it via BGP etc.
You can spoof addresses all you want but it will generally not do much for a stateful protocol
> People aren't using static ips for authentication purposes
Unfortunately they are! I’ve seen up whitelistijg used as the only means of authentication over the WAN several times
> People aren't using static ips for authentication purposes
Lol. Of course they do. In fact, it's the only viable way to authenticate servers in Current Year. Unlike ssh host keys, of which literally nobody on this planet takes seriously, or https certificates which is just make-work security theater.
Now this is an interesting take - I can’t tell if you are being serious
> Modal has an isolated container runtime that lets us share each host’s CPU and memory between workloads.
Looks like Modal hosts workloads in Containers, not VMs. How do you enforce secure isolation with this design? A single kernel vulnerability could lead to remote execution on the host, impacting all workloads . Am I missing anything?
I mentioned this in another comment thread, but we use gVisor to enforce isolation. https://gvisor.dev/users/
It's also used by Google Kubernetes Engine, OpenAI, and Cloudflare among others to run untrusted code.
And Google's own serverless offerings (App Engine, Cloud Run, Cloud Functions) :-)
Disclaimer: I'm an SRE on the GCP Serverless products.
Neat, thanks for sharing! Glad to know we're in good company here.
Why is it important to have a static outbound ip address?
side question: what do you use to make the diagrams?
This is just what I needed. Chefs kiss.
Do you block certain ports?
It's going to take years for orgs to adopt IPv6 and mTLS+JWT/OIDC.
Even longer for QUIC/H3.
I’m not convinced that mTLS or OIDc are good ideas
... Are you going to say why?
I could go into detail, but it was stated as if these technologies should unambiguously be adopted
I am very curious, because I do think they are unambiguously a good thing.
I guess my first question is, why is this built on IPv4 rather than IPv6...
Yeah, great question. This came up at the beginning of design. A lot of our customers specifically needed IPv4 whitelisting. For example, MongoDB Atlas (a very popular database vendor) only supports IPv4. https://www.mongodb.com/community/forums/t/does-mongodb-atla...
The architecture of vprox is pretty generic though and could support IPv6 as well.
I guess that works until other customers need access to IPv6-only resources… (e.g.: we've stopped rolling IPv4 to any of our CI. No IPv6, no build artifacts…)
In a perfect world I'd also be asking whether you considered NAT64, but unfortunately I'm well aware that's a giant world of pain to get to work on Linux (involving either out-of-tree Jool, or full-on VPP)
Yeah, you hit the nail on the head. We considered NAT64 as well and looked at some implementations including eBPF-based ones like Cilium.
Glad to know that IPv6-only is working well for you. "In a perfect world…" :)
It is what it is :/ … I do periodically ask these questions to track how v4-vs-v6 things are developing, and they're moving, albeit at a snail's pace.
(FTR, it works for us because our CI is relatively self-contained. And we have local git mirrors… f***ing github…)
At my company (Fortune 100), we've been selling a lot of our public v4 space to implement... RFC1918 space. We've re-IP'd over 50,000 systems so far to private space. We just implemented NAT for the first time ever. I was surprised to see how far behind some companies are.
Progress is coming from the weirdest corners… US DoD and NATO require IPv6 feature-parity to IPv4 nowadays, no full IPv6 = no bidding on tenders…
(I would already have expected this to be quite effective in forcing IPv6, but tbh I'm still surprised just how effective.)
Couldn't a NAT instance in-front of containers accomplish this as well (assuming only needed for outbound traffic)? The open source project fck-nat[1] looks amazing for this purpose.
[1] https://fck-nat.dev/stable/
Right, vprox servers act as multiplexed NAT instances with a VPN attached. You do still need the VPN part though since our containers run around the world, in multiple regions and availability zones. Setting the gateway to a machine running fck-nat would only work if that machine is in the same subnet (e.g., for AWS, in one availability zone).
The other features that were hard requirements for us were multi-tenancy and high availability / failover.
By the way, fck-nat is just a basic shell script that sets the `ip_forward` and `rp_filter` sysctls and adds an IP masquerade rule. If you look at vprox, we also do this but build a lot on top of it. https://github.com/modal-labs/vprox
Ahh that makes sense. I do think that a single fck-nat instance can service multiple AZ's though in a AWS region. Just need to adjust the VPC routing table. Thanks for the reply and info.