Static IPs for Serverless Containers

125 points by ekzhang 7 months ago

ekzhang 7 months ago

Hi! This is a blog post sharing some low-level Linux networking we're doing at Modal with WireGuard.

As a serverless platform we hit a bit of a tricky tradeoff: we run multi-tenant user workloads on machines around the world, and each serverless function is an autoscaling container pool. How do you let users give their functions static IPs, but also decouple them from compute resource flexibility?

We needed a high-availability VPN proxy for containers and didn't find one, so we built our own on top of WireGuard and open-sourced it at https://github.com/modal-labs/vprox

Let us know if you have thoughts! I'm relatively new to low-level container networking, and we (me + my coworkers Luis and Jeffrey + others) have enjoyed working on this.

crishoj 7 months ago

Neat. I am curious what notable differences there are between Modal and Tailscale.
- ekzhang 7 months ago
  
  Thanks. We did check out Tailscale, but they didn't quite have what we were looking for: some high-availability custom component that plugs into a low-level container runtime. (Which makes sense, it's pretty different from their intended use case.)
  Modal is actually a happy customer of Tailscale (but for other purposes). :D
  - mrbluecoat 7 months ago
    
    So if a company only needs an outbound VPN for their road warriors and not an inbound VPN to access internal servers, vprox could be a simpler alternative to Tailscale?
xxpor 7 months ago

You're using containers as a multi-tenancy boundary for arbitrary code?
- ekzhang 7 months ago
  
  We use gVisor! It's an open-source application security sandbox spun off from Google. We work with the gVisor team to get the features we need (notably GPUs / CUDA support) and also help test gVisor upstream https://gvisor.dev/users/
  It's also used by Google Kubernetes Engine, OpenAI, and Cloudflare among others to run untrusted code.
  - doctorpangloss 7 months ago
    
    Are these the facts?
    - You are using a container orchestrator like Kubernetes
    - You are using gVisor as a container runtime
    - Two applications from different users, containerized, are scheduled on the same node.
    Then, which of the following are true?
    (1) Both have shared access to an NVIDIA GPU
    (2) Both share access to the NVIDIA GPU via CUDA MPS
    (3) If there were 2 or more MIGs on the node with a MIG-supporting GPU, the NVIDIA container toolkit shim assigned a distinct MIG to each application
    
    ekzhang 7 months ago
    
    We don't use Kubernetes to run user workloads, we do use gVisor. We don't use MIG (multi-instance GPU) or MPS. If you run a container on Modal using N GPUs, you get the entire N GPUs.
    If you'd like to learn more, you can check out our docs here: https://modal.com/docs/guide/gpu
    Re not using Kubernetes, we have our own custom container runtime in Rust with optimizations like lazy loading of content-addressed file systems. https://www.youtube.com/watch?v=SlkEW4C2kd4
    
    ec109685 7 months ago
    
    If the nvidia driver has a bug, can one workload access data of another running on the physical machine?
    E.g. it came up in this thread: https://news.ycombinator.com/item?id=41672168
    
    JoshuaJB 7 months ago
    
    Yes. The kernel has access to data from every workload, and so technically a bug in _anything_ running at kernel level could result in data leakage.
    
    doctorpangloss 7 months ago
    
    Suppose I ask for two H100s. Will I have GPU P2P capabilities?
    
    ekzhang 7 months ago
    
    Yep! This is something we have internal tests for haha, you have good instincts that it can be tricky. Here's an example of using that for multi-GPU training https://modal.com/docs/examples/llm-finetuning
    
    doctorpangloss 7 months ago
    
    Okay, well think very deeply about what you are saying about isolation; the topology of the hardware; and why NVIDIA does not allow P2P access even in vGPU settings except in specific circumstances that are not yours. I think if it were as easy to make the isolation promises you are making, NVIDIA would already do it. Malformed NVLink messages make GPUs fall off the bus even in trusted applications.
    
    thundergolfer 7 months ago
    
    Yes it will.
    (I work at Modal.)
dangoodmanUT 7 months ago

As a predominantly rust shop, why choose go for this?
- klooney 7 months ago
  
  Wireguard's premiere user space implementation is in go.
  - rendaw 7 months ago
    
    Premiere = official? Do you know how it compares to this by Cloudflare? https://github.com/cloudflare/boringtun
    
    ekzhang 7 months ago
    
    Sorry, just realized the misunderstanding. To clarify Modal still uses the kernel WireGuard module. The userspace part that’s in Go and not in other languages that we use is the wgctrl library.
  - ekzhang 7 months ago
    
    This is it. I like Rust a lot, but you gotta pick the right tool for the job sometimes

jimmyl02 7 months ago

this is a really neat writeup! the design choice to make each "exit node" control the local wireguard connections instead of a global control plane is pretty neat.

an unfinished project I worked on (https://github.com/redpwn/rvpn) was a bit more ambitious with a global control plane and I quickly learned supporting multiple clients especially anything networking related is a tarpit. the focus on linux / aws specifically here and the results achievable from it are nice to see.

networking is challenging and this was a nice deep dive into some networking internals, thanks for sharing the details :)

ekzhang 7 months ago

Thanks for sharing. I'm interested in seeing what a global control plane might look like, seems like authentication might be tricky to get right!
Controlling our worker environment (like `net.ipv4.conf.all.rp_filter` sysctl) is a big help for us since it means we don't have to deal with the fullness of all possible network configurations.

qianli_cs 7 months ago

Thanks for sharing. This new feature is neat! It might sound a bit out there, but here's a thought: could you enable assigning unique IP addresses to different serverless instances? For certain use cases, like web scraping, it's helpful to simulate requests coming from multiple locations instead of just one. I think allowing requests to originate from a pool of IP addresses would be doable given this proxy model.

heinternets 7 months ago

So much work seems to go into working around the limitations of IPv4 instead of towards a fully IPv6 capable world.

klysm 7 months ago

Unfortunately we gotta do both. Overlay networks like wireguard might be a good stepping stone to move software towards IPv6 anyway

cactacea 7 months ago

Static IPs for allowlists need to die already. Its 2024, come on, surely we can do better than this

ekzhang 7 months ago

What would you suggest as an alternative?
- sofixa 7 months ago
  
  JWT/OIDC, where the thing you're authenticating to (like MongoDB Atlas) trusts your identity provider (AWS, GCP, Modal, GitLab CI). It's better than mTLS because it allows for more flexibility in claims (extra metadata and security checks can be done with arbitrary data provided by the identity provider), and JWTs are usually shorter lived than certificates.
  - ekzhang 7 months ago
    
    We have a native OIDC integration at Modal, as well! Every container gets a token. https://modal.com/docs/guide/oidc-integration
    
    sofixa 7 months ago
    
    Awesome, great for you. OIDC/JWT for cross-stuff auth should become the norm.
    
    fusjdffddddddds 7 months ago
    
    [flagged]
  - Thaxll 7 months ago
    
    How do you allow a driver using that exactly?
    
    sofixa 7 months ago
    
    A db connection driver? You pass the JWT as the username/password which contains the information about your identity and is signed by the identity provider that the party you're authenticating to has been configured to trust.
    Or, you use a broker like Vault to which you authenticate with that JWT, and which generates a just in time ephemeral username/password for your database, which gets rotated at some point.
- thatfunkymunki 7 months ago
  
  a more modern, zero-trust solution like mTLS authentication
  - ekzhang 7 months ago
    
    That makes sense, mTLS is great. Some services like Google Cloud SQL are really good about support for it. https://cloud.google.com/sql/docs/mysql/configure-ssl-instan...
    It's not quite a zero-trust solution though due to the CA chain of trust.
    mTLS is security at a different layer though than IP source whitelisting. I'd say that a lot of companies we spoke to would want both as a defense-in-depth measure. Even with mTLS, network whitelisting is relevant. If your certificate were to be exposed for instance, an attacker would still need to be able to forge a source IP address to start a connection.
    
    PLG88 7 months ago
    
    If mTLS is combined with outbound connections, then IP source whitelisting is irrelevant; the external network cannot connect to your resources.
    This (and more) is exactly what we (I work on it) built with open source OpenZiti, a zero trust networking platform. Bonus points, it includes SDKs so you can embed ZTN into the serverless function, a colleague demonstrated it with a Python workload on AWS - https://blog.openziti.io/my-intern-assignment-call-a-dark-we....
    
    thatfunkymunki 7 months ago
    
    I'd put it in the zero-trust category if the server (or owner of the server, etc) is the issuer of the client certificate and the client uses that certificate to authenticate itself, but I'll admit this is a pedantic point that adds nothing of substance. The idea being that you trust your issuance of the certificate and the various things that can be asserted based on how it was issued (stored in TPM, etc), rather than any parameter that could be controlled by the remote party.
klysm 7 months ago

Completely agree. IP addresses are almost never a good means of authentication. It results in brittle and inflexible architecture as well. Applications become aware of layers they should be abstracted from
- bogantech 7 months ago
  
  Firewalls exist, many network environments block everything not explicitly allowed.
  Authentication is only part of the problem, networks are firewalled (with dedicated appliances) and segmented to prevent lateral movement in the event of a compromise
  - klysm 7 months ago
    
    Isn’t that completely orthogonal? IP addresses aren’t authenticated, they can be spoofed
    
    bogantech 7 months ago
    
    It's not authentication. People aren't using static ips for authentication purposes
    But if I have firewall policies that allow connections only to specific services I need a destination address and port (yes, some firewalls allow host names but there's drawbacks to that)
    > IP addresses aren't authenticated, they can be spoofed
    For anything bidirectional you'd need the client to have a route back to you for that address, which would require you compromising some routers and advertising it via BGP etc.
    You can spoof addresses all you want but it will generally not do much for a stateful protocol
    
    klysm 7 months ago
    
    > People aren't using static ips for authentication purposes
    Unfortunately they are! I’ve seen up whitelistijg used as the only means of authentication over the WAN several times
    
    otabdeveloper4 7 months ago
    
    > People aren't using static ips for authentication purposes
    Lol. Of course they do. In fact, it's the only viable way to authenticate servers in Current Year. Unlike ssh host keys, of which literally nobody on this planet takes seriously, or https certificates which is just make-work security theater.
    
    klysm 7 months ago
    
    Now this is an interesting take - I can’t tell if you are being serious
    
    otabdeveloper4 7 months ago
    
    I am serious. Have you ever done infrastructure work? The big and serious guys all use IP whitelists. Look at how email actually works, for example.

stuckkeys 7 months ago

This is just what I needed. Chefs kiss.

ATechGuy 7 months ago

> Modal has an isolated container runtime that lets us share each host’s CPU and memory between workloads.

Looks like Modal hosts workloads in Containers, not VMs. How do you enforce secure isolation with this design? A single kernel vulnerability could lead to remote execution on the host, impacting all workloads . Am I missing anything?

ekzhang 7 months ago

I mentioned this in another comment thread, but we use gVisor to enforce isolation. https://gvisor.dev/users/
It's also used by Google Kubernetes Engine, OpenAI, and Cloudflare among others to run untrusted code.
- yegle 7 months ago
  
  And Google's own serverless offerings (App Engine, Cloud Run, Cloud Functions) :-)
  Disclaimer: I'm an SRE on the GCP Serverless products.
  - ekzhang 7 months ago
    
    Neat, thanks for sharing! Glad to know we're in good company here.

klysm 7 months ago

Why is it important to have a static outbound ip address?

handfuloflight 7 months ago

Do you block certain ports?

fusjdffddddddds 7 months ago

It's going to take years for orgs to adopt IPv6 and mTLS+JWT/OIDC.

Even longer for QUIC/H3.

klysm 7 months ago

I’m not convinced that mTLS or OIDc are good ideas
- fusjdffddddddds 7 months ago
  
  ... Are you going to say why?
  - klysm 7 months ago
    
    I could go into detail, but it was stated as if these technologies should unambiguously be adopted
    
    PLG88 7 months ago
    
    I am very curious, because I do think they are unambiguously a good thing.

eqvinox 7 months ago

I guess my first question is, why is this built on IPv4 rather than IPv6...

ekzhang 7 months ago

Yeah, great question. This came up at the beginning of design. A lot of our customers specifically needed IPv4 whitelisting. For example, MongoDB Atlas (a very popular database vendor) only supports IPv4. https://www.mongodb.com/community/forums/t/does-mongodb-atla...
The architecture of vprox is pretty generic though and could support IPv6 as well.
- eqvinox 7 months ago
  
  I guess that works until other customers need access to IPv6-only resources… (e.g.: we've stopped rolling IPv4 to any of our CI. No IPv6, no build artifacts…)
  In a perfect world I'd also be asking whether you considered NAT64, but unfortunately I'm well aware that's a giant world of pain to get to work on Linux (involving either out-of-tree Jool, or full-on VPP)
  - ekzhang 7 months ago
    
    Yeah, you hit the nail on the head. We considered NAT64 as well and looked at some implementations including eBPF-based ones like Cilium.
    Glad to know that IPv6-only is working well for you. "In a perfect world…" :)
    
    eqvinox 7 months ago
    
    It is what it is :/ … I do periodically ask these questions to track how v4-vs-v6 things are developing, and they're moving, albeit at a snail's pace.
    (FTR, it works for us because our CI is relatively self-contained. And we have local git mirrors… f***ing github…)
    
    lowpro 7 months ago
    
    At my company (Fortune 100), we've been selling a lot of our public v4 space to implement... RFC1918 space. We've re-IP'd over 50,000 systems so far to private space. We just implemented NAT for the first time ever. I was surprised to see how far behind some companies are.
    
    eqvinox 7 months ago
    
    Progress is coming from the weirdest corners… US DoD and NATO require IPv6 feature-parity to IPv4 nowadays, no full IPv6 = no bidding on tenders…
    (I would already have expected this to be quite effective in forcing IPv6, but tbh I'm still surprised just how effective.)

techn00 7 months ago

side question: what do you use to make the diagrams?

nodesocket 7 months ago

Couldn't a NAT instance in-front of containers accomplish this as well (assuming only needed for outbound traffic)? The open source project fck-nat[1] looks amazing for this purpose.

[1] https://fck-nat.dev/stable/

ekzhang 7 months ago

Right, vprox servers act as multiplexed NAT instances with a VPN attached. You do still need the VPN part though since our containers run around the world, in multiple regions and availability zones. Setting the gateway to a machine running fck-nat would only work if that machine is in the same subnet (e.g., for AWS, in one availability zone).
The other features that were hard requirements for us were multi-tenancy and high availability / failover.
By the way, fck-nat is just a basic shell script that sets the `ip_forward` and `rp_filter` sysctls and adds an IP masquerade rule. If you look at vprox, we also do this but build a lot on top of it. https://github.com/modal-labs/vprox
- nodesocket 7 months ago
  
  Ahh that makes sense. I do think that a single fck-nat instance can service multiple AZ's though in a AWS region. Just need to adjust the VPC routing table. Thanks for the reply and info.