What conntrack actually tracks (and what it costs you)

conntrack is the Linux kernel’s connection tracking subsystem. It’s the thing that lets stateful firewall rules work — rules like “allow established connections” or “match this packet to its return flow” depend on conntrack remembering what’s already happened.

For most people, conntrack is invisible until it breaks. Then it breaks loudly. Connection tables fill up, connections get dropped, NAT goes wrong, and the dmesg log is full of “nf_conntrack: table full, dropping packet.”

Here’s what conntrack actually does, and how to keep it healthy.

What an entry looks like

When the kernel sees a new flow, it creates a conntrack entry. Each entry tracks:

The 5-tuple in both directions (source IP, source port, dest IP, dest port, protocol)
Connection state for the protocol (e.g., TCP states like SYN_SENT, ESTABLISHED, TIME_WAIT)
Per-direction byte and packet counters
Mark and labels (for tagging, optional)
An expiration timeout

You can dump the full state with conntrack -L:

tcp      6 431993 ESTABLISHED src=10.0.0.4 dst=93.184.216.34 sport=42010 dport=443
         src=93.184.216.34 dst=10.0.0.4 sport=443 dport=42010 [ASSURED]
         mark=0 use=2

The [ASSURED] flag means the kernel saw traffic in both directions. Until then, the entry is “unconfirmed” and easier to evict.

Where conntrack runs

Conntrack hooks into netfilter’s PREROUTING and OUTPUT chains. Every packet that passes through netfilter gets looked up in the conntrack table, matched to an existing entry, or creates a new one.

Containers complicate this. Each network namespace has its own conntrack table. A packet entering the host, getting NATed to a container, and entering the container’s namespace will have entries in both tables, with different views of the same flow.

This is also why nf_conntrack: table full is such a frequent issue with K8s nodes — kube-proxy in iptables mode creates conntrack pressure across many namespaces.

Sizing and tuning

Two parameters matter most:

# Maximum number of conntrack entries
sysctl net.netfilter.nf_conntrack_max

# Hash table size (default scales with max)
sysctl net.netfilter.nf_conntrack_buckets

The default nf_conntrack_max on a typical server is 65536 or 131072. For high-traffic nodes, this is too low. Tune it up based on cat /proc/net/nf_conntrack | wc -l — if you regularly see >70% utilization, raise the max.

Don’t raise it without raising buckets too. The hash table size affects lookup performance. A common ratio is buckets = max / 4.

Timeouts

Each protocol and state has its own timeout. Defaults are conservative:

sysctl net.netfilter.nf_conntrack_tcp_timeout_established  # default: 432000 (5 days)
sysctl net.netfilter.nf_conntrack_tcp_timeout_time_wait    # default: 120
sysctl net.netfilter.nf_conntrack_udp_timeout              # default: 30
sysctl net.netfilter.nf_conntrack_udp_timeout_stream       # default: 180

Five days for ESTABLISHED is wildly excessive for most workloads. If you’re hitting table-full issues, lowering established timeout to a few hours is one of the most effective interventions:

sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400  # 24h

For high-volume HTTP-style services, you can go shorter (3600). Don’t go below ~600 unless you really know your traffic — short timeouts increase NAT collision risk.

How to observe pressure

Useful commands:

# Current count
cat /proc/sys/net/netfilter/nf_conntrack_count

# Maximum allowed
cat /proc/sys/net/netfilter/nf_conntrack_max

# Number of times the table was full and a packet was dropped
cat /proc/net/stat/nf_conntrack | awk '{print $4}'

# Per-protocol breakdown
conntrack -S

If dropping packet lines start appearing in dmesg, you’re already at the wall. Set up an alert on nf_conntrack_count > 0.85 * nf_conntrack_max.

What conntrack does for you (and what it doesn’t)

Conntrack gives you:

Stateful firewall rules (iptables -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT)
NAT bidirectional consistency (return packets find their way back)
Helper modules for protocols that need them (FTP, SIP, etc.)
Free per-flow byte/packet counters

Conntrack does not give you:

Layer-7 awareness — it doesn’t know HTTP from raw TCP
TCP reassembly — entries see segment metadata, not application data
Long-term flow storage — entries expire, no built-in archive
Performance — under heavy small-packet traffic, conntrack lookup is significant overhead

For high-rate environments where you don’t actually need state (DDoS scrubbers, load balancers in DSR mode), people sometimes disable conntrack with iptables -t raw -A PREROUTING -j NOTRACK. This is correct for those workloads. It’s wrong for typical server use.

When BPF is the better tool

If you’re building per-flow accounting, intrusion detection, or anything that wants visibility into flow lifecycle without the overhead and tuning burden of conntrack, BPF maps work better. You define your own retention, your own eviction, your own labels. You don’t pay for state tracking you don’t need.

But if you’re filtering based on connection state (“only allow ESTABLISHED return traffic”), conntrack is doing real work for you and reimplementing it in BPF is a bigger project than people expect.

Pick based on what state you actually need to know. The defaults are designed for a server doing a moderate amount of moderate-sized connections. They’re often wrong for real production load.