Container Networking Deep Dive: From veth Pairs to Service Mesh
"It works on my machine but not in the cluster." Nine times out of ten, it's a networking problem. And most engineers treat container networking as a black box.
Let's open the box. We'll trace how a packet travels from one container to another — first on a single host, then across a Kubernetes cluster. Understanding this makes debugging 10x faster.
Level 0: Linux Network Namespaces
Containers don't have their own network stack. They use Linux network namespaces — an isolation primitive that gives each container its own network interfaces, routing table, and iptables rules.
# Create a network namespace (this is what container runtimes do)
ip netns add container1
# List interfaces inside the namespace — it's empty
ip netns exec container1 ip link
# 1: lo: <LOOPBACK> ... state DOWN
A fresh namespace has only a loopback interface. No connectivity. To connect it to the outside world, we need a veth pair.
Level 1: veth Pairs — Virtual Ethernet Cables
A veth pair is a virtual Ethernet cable with two ends. Put one end in the container's namespace, the other on the host.
# Create a veth pair
ip link add veth-host type veth peer name veth-container
# Move one end into the container namespace
ip link set veth-container netns container1
# Assign IPs
ip addr add 10.0.0.1/24 dev veth-host
ip netns exec container1 ip addr add 10.0.0.2/24 dev veth-container
# Bring them up
ip link set veth-host up
ip netns exec container1 ip link set veth-container up
# Ping from host to container
ping 10.0.0.2 # Works!
This is the fundamental building block. Every container has a veth pair connecting it to the host's network.
Level 2: Linux Bridge — Container-to-Container on One Host
With two containers on the same host, we need a bridge — a virtual Layer 2 switch:
┌─────────────┐ ┌─────────────┐
│ Container A │ │ Container B │
│ 10.0.0.2 │ │ 10.0.0.3 │
│ veth-a-in │ │ veth-b-in │
└──────┬───────┘ └──────┬───────┘
│ │
veth-a-out veth-b-out
│ │
└───────┬───────────┘
│
┌────┴────┐
│ bridge │ (docker0, cni0, etc.)
│ 10.0.0.1 │
└────┬─────┘
│
eth0 (host)
# Create bridge
ip link add br0 type bridge
ip link set br0 up
ip addr add 10.0.0.1/24 dev br0
# Attach container A's veth to bridge
ip link set veth-a-out master br0
# Attach container B's veth to bridge
ip link set veth-b-out master br0
Now Container A can reach Container B via the bridge. This is exactly what Docker's docker0 bridge does.
Level 3: Crossing Hosts — Overlay Networks
Containers on different hosts can't reach each other through a local bridge. We need an overlay network that encapsulates container traffic inside host-to-host packets.
VXLAN
The most common overlay technology. It wraps the container's Ethernet frame inside a UDP packet:
Original packet:
[Container A: 10.244.1.5] → [Container B: 10.244.2.8]
Encapsulated:
[Host 1: 192.168.1.10] → [Host 2: 192.168.1.20]
└── UDP port 4789 (VXLAN)
└── [10.244.1.5] → [10.244.2.8] (original packet)
# Create VXLAN interface on Host 1
ip link add vxlan0 type vxlan id 42 \
local 192.168.1.10 \
dstport 4789 \
group 239.1.1.1 \
dev eth0
ip link set vxlan0 master br0
ip link set vxlan0 up
The overhead is about 50 bytes per packet (outer headers). For most workloads, negligible. For latency-sensitive workloads, it matters.
Level 4: Kubernetes Networking Model
Kubernetes mandates three rules:
- Every Pod gets its own IP — no NAT between pods
- All Pods can reach all other Pods — without NAT
- The IP a Pod sees for itself is the same IP others see
How this is implemented depends on the CNI plugin:
| CNI Plugin | Approach | Performance | Features |
|---|---|---|---|
| Flannel | VXLAN overlay | Good | Simple, minimal |
| Calico | BGP routing or VXLAN | Excellent | Network policies |
| Cilium | eBPF (kernel-level) | Best | L7 policies, encryption |
| Weave | VXLAN with encryption | Good | Simple, auto-mesh |
How a Pod-to-Pod Packet Travels (Calico BGP mode)
Pod A (10.244.1.5) on Node 1
→ veth pair → cali* interface on host
→ host routing table (learned via BGP from other nodes)
→ eth0 → physical network
→ Node 2 eth0
→ host routing table → cali* interface
→ veth pair → Pod B (10.244.2.8)
No encapsulation. No overhead. The physical network routes pod CIDRs natively via BGP. This is why we run Calico in BGP mode on our infrastructure — we already speak BGP everywhere (AS210622), so pod routing is just another set of prefixes.
Level 5: Services and kube-proxy
Kubernetes Services provide stable IPs (ClusterIPs) that load-balance across pods. But ClusterIPs don't exist on any interface — they're virtual IPs implemented by kube-proxy using iptables or IPVS.
iptables mode
# kube-proxy creates rules like:
iptables -t nat -A KUBE-SERVICES \
-d 10.96.0.10/32 -p tcp --dport 80 \
-j KUBE-SVC-XXXX
# KUBE-SVC-XXXX distributes to endpoints:
iptables -t nat -A KUBE-SVC-XXXX \
-m statistic --mode random --probability 0.333 \
-j KUBE-SEP-AAAA # Pod 1
iptables -t nat -A KUBE-SVC-XXXX \
-m statistic --mode random --probability 0.500 \
-j KUBE-SEP-BBBB # Pod 2
iptables -t nat -A KUBE-SVC-XXXX \
-j KUBE-SEP-CCCC # Pod 3
This works but doesn't scale. With 10,000 services, you have tens of thousands of iptables rules. Rule evaluation is O(n).
IPVS mode
IPVS uses hash tables for service lookup — O(1) regardless of how many services you have. We switch every cluster to IPVS mode:
# kube-proxy configmap
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
scheduler: "lc" # least connections
Cilium: Replacing kube-proxy entirely
Cilium can replace kube-proxy with eBPF programs that handle service routing in the kernel — before iptables is even consulted. This is the fastest option and what we deploy on performance-sensitive clusters.
Level 6: Service Mesh
A service mesh adds a sidecar proxy (Envoy) to every pod. All traffic flows through the proxy:
Pod A → Envoy sidecar → network → Envoy sidecar → Pod B
This enables:
- mTLS — Encrypted, authenticated communication between all services
- Traffic splitting — Route 5% to canary, 95% to stable
- Retries and circuit breaking — Resilience without application changes
- Observability — Every request is traced and metricked automatically
The cost: latency (2-5ms per hop) and resource overhead (each Envoy sidecar uses ~50MB RAM).
When to use a service mesh: When you have 20+ microservices and need mTLS, traffic management, or deep observability. For simpler setups, Cilium's L7 capabilities often suffice without the sidecar overhead.
Debugging Toolkit
When networking breaks, these commands save hours:
# See which namespace a container is in
crictl inspect <container-id> | jq '.info.pid'
nsenter -t <pid> -n ip addr
# Trace packet path
tcpdump -i any -n host 10.244.1.5
# Check iptables rules for a service
iptables -t nat -L KUBE-SERVICES -n | grep <cluster-ip>
# Verify CNI is healthy
kubectl get pods -n kube-system -l k8s-app=calico-node
# Test pod-to-pod connectivity
kubectl exec -it debug-pod -- curl -v http://10.244.2.8:8080
# Check for conntrack table exhaustion
conntrack -C # current count
cat /proc/sys/net/nf_conntrack_max # max
Key Takeaways
- Container networking is Linux networking — namespaces, veth pairs, bridges. The primitives are simple.
- Overlay vs. native routing — Overlays are simpler to set up. Native routing (BGP) has less overhead. Choose based on your environment.
- eBPF is the future — Cilium replaces iptables, kube-proxy, and parts of the service mesh with kernel-level programs. It's faster and simpler.
- Understand the stack — When something breaks, knowing which layer to debug saves hours.
Networking issues in your cluster? We debug container networking daily across dozens of environments. Reach out and we'll get your packets flowing.