When Your Server Lags Only When You're Not Watching: Fixing IRQ Imbalance
When Your Server Lags Only When You’re Not Watching: Fixing IRQ Imbalance
TL;DR: My Proxmox server had mysterious network lag that disappeared the moment I opened htop to investigate. Turned out all network interrupts were being handled by a single CPU core. After enabling irqbalance, tuning RPS, and adjusting network buffers, the lag vanished for good. Here’s how to diagnose and fix IRQ imbalance on your own server.
The Observer Effect Bug
You know that feeling when your car makes a weird noise for weeks, but the moment you drive it to the mechanic, it runs perfectly? That was my Proxmox server last month.
The symptoms:
- Random network lag spikes (200-500ms ping jumps)
- Container web services occasionally timing out
- SSH sessions freezing for 2-3 seconds randomly
- The kicker: The moment I SSH’d in and ran
htopor opened Netdata, everything ran smoothly
This is the most frustrating kind of bug - one that hides when you try to observe it. Schrödinger’s lag, if you will.
💡 Spoiler: This isn’t magic. When you run monitoring tools, they spin up processes on different CPU cores, which temporarily shifts the workload and gives the overloaded core a moment to catch up. It’s like unclogging a drain by running water through a different pipe.
Prerequisites
Before we dive in, you’ll need:
- A Linux server (I’m using Proxmox, but this applies to any Debian/Ubuntu-based system)
- Root access
- Basic familiarity with the command line
- About 30 minutes for diagnostics and fixes
- Patience (some of this involves watching counters tick up)
Knowledge level: Intermediate - I’ll explain concepts, but you should be comfortable with SSH and editing system files.
Step 1: Security Check First
When weird performance issues show up out of nowhere, my first thought is always: “Did someone break in?”
Before optimizing anything, I checked for:
# Check for suspicious processes
ps aux | grep -iE 'xmr|crypto|mine'
# Look for unauthorized SSH keys
cat ~/.ssh/authorized_keys
find /home -name authorized_keys -exec cat {} \;
# Check for rootkits
apt install -y chkrootkit rkhunter
chkrootkit
rkhunter --check --skip-keypress
# Review running network connections
netstat -tulpn | grep ESTABLISHED
# Check cron jobs for weird stuff
crontab -l
cat /etc/cron.d/*
Everything came back clean. No crypto miners, no suspicious processes, no unauthorized access. Good, but now I had to actually debug the real problem.
⚠️ Warning: Never skip this step. Performance issues can be symptoms of compromise. Always rule out security issues before assuming it’s just a configuration problem.
Step 2: Discovering the IRQ Imbalance
After ruling out malware, I started looking at resource distribution. First stop: interrupts.
What are IRQs anyway?
IRQ (Interrupt Request) is how hardware devices (like your network card) get the CPU’s attention. When a network packet arrives, the NIC says “Hey CPU, I got something for you!” via an interrupt.
Normally, these interrupts should be distributed across all CPU cores. But let’s check:
# Watch interrupt distribution in real-time
watch -n 2 'cat /proc/interrupts | grep -E "CPU|eth0|eno1|enp"'
💡 Tip: Replace
eth0oreno1with your actual network interface name. Find it withip link show.
Here’s what I saw:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 ...
eth0-TxRx 3301245 0 0 0 0 0 ...
Yikes. All 3.3 million interrupts on CPU0 only. Every single network packet being handled by one core while the other 11 cores sat idle like they were on break.
Checking softirqs
Interrupts have a two-phase handling:
- Hard IRQ: “Hey, packet arrived!” (handled immediately)
- Soft IRQ: “Let me process that packet” (scheduled work)
Let’s check the softirq distribution:
watch -n 2 'cat /proc/softirqs | grep -E "CPU|NET_RX"'
CPU0 CPU1 CPU2 CPU3 ...
NET_RX: 8504321 850432 823012 801234 ...
CPU0 was handling 10x more receive softirqs than other cores. This was the smoking gun.
Why monitoring “fixed” it
When I opened htop, it would:
- Spawn processes on available CPU cores
- Trigger kernel scheduler activity
- Temporarily shift some workload away from CPU0
- Give CPU0 breathing room to process its backlog
The lag wasn’t actually “fixed” - it was just masked temporarily. Like putting a band-aid on a broken pipe.
Step 3: The Fix (Multiple Layers)
IRQ imbalance isn’t fixed with one magic command. It’s a stack of improvements:
Layer 1: Install irqbalance
This daemon automatically distributes interrupts across CPU cores:
# Install and enable
apt install -y irqbalance
systemctl enable --now irqbalance
# Verify it's running
systemctl status irqbalance
Within 10 seconds, I saw interrupts spreading across cores:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 ...
eth0-TxRx 3301245 1203 2104 1832 2411 1937 ...
Much better! But we’re not done.
Layer 2: Enable RPS (Receive Packet Steering)
RPS is like power steering for network packets - it helps distribute packet processing across CPU cores at the software level.
# Find your interface name
ip link show
# Enable RPS for all CPUs (adjust the hex value based on your CPU count)
# fff = 12 CPUs in hex (111111111111 in binary)
# For 4 CPUs use 'f', for 8 use 'ff', for 16 use 'ffff'
echo "fff" > /sys/class/net/eth0/queues/rx-0/rps_cpus
# Verify
cat /sys/class/net/eth0/queues/rx-0/rps_cpus
How to calculate your RPS mask:
- Count your CPU cores:
nproc - Convert to hex: 4 CPUs =
f, 8 =ff, 12 =fff, 16 =ffff - The mask is a bitmask where each bit represents a CPU
Layer 3: Increase NIC Ring Buffers
The NIC’s ring buffer is like a waiting room for packets. Mine was tiny:
# Check current size
ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
TX: 4096
Current hardware settings:
RX: 256 # Way too small!
TX: 256
# Increase to maximum
ethtool -G eth0 rx 4096 tx 4096
# Verify
ethtool -g eth0
Layer 4: Tune Network Stack (sysctl)
Create /etc/sysctl.d/99-network-tuning.conf:
# Increase socket buffer sizes (128MB)
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
# TCP buffer auto-tuning
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
# Increase network device backlog queue
net.core.netdev_max_backlog = 5000
# Increase softirq processing budget
net.core.netdev_budget = 600
# Enable TCP window scaling
net.ipv4.tcp_window_scaling = 1
Apply immediately:
sysctl -p /etc/sysctl.d/99-network-tuning.conf
What these do:
rmem_max/wmem_max: Maximum socket buffer sizes (increased from ~200KB to 128MB)tcp_rmem/tcp_wmem: TCP buffer auto-tuning rangesnetdev_max_backlog: How many packets can queue before droppingnetdev_budget: How many packets to process per softirq cycle
Step 4: Make It Persistent
The ethtool and RPS changes reset on reboot. Let’s fix that.
Create /etc/systemd/system/network-tuning.service:
[Unit]
Description=Network Performance Tuning
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
# Replace eth0 with your interface name and fff with your CPU mask
ExecStart=/bin/bash -c 'echo "fff" > /sys/class/net/eth0/queues/rx-0/rps_cpus'
ExecStart=/usr/sbin/ethtool -G eth0 rx 4096 tx 4096
[Install]
WantedBy=multi-user.target
⚠️ Important: Replace
eth0with your interface name andfffwith your CPU mask!
Enable it:
systemctl daemon-reload
systemctl enable network-tuning.service
systemctl start network-tuning.service
Step 5: Verify the Fix
Now let’s confirm everything is working:
Monitor interrupt distribution:
watch -n 2 'cat /proc/interrupts | grep -E "CPU|eth0"'
You should see interrupts spreading across all cores.
Check for packet drops:
netstat -s | grep -iE 'drop|prune|collapse'
Before the fix, I was seeing thousands of dropped packets. After: zero.
Watch softirq distribution:
watch -n 2 'cat /proc/softirqs | grep -E "CPU|NET_RX"'
NET_RX should be relatively balanced across cores (not perfectly even, but no single core with 10x load).
Real-world test:
# From another machine
ping your-server-ip -c 100
Before: 5-10% of pings had 200-500ms spikes
After: Consistent sub-5ms responses
Mistakes I Made
-
Didn’t check security first (initially): I jumped straight into performance tuning before ruling out crypto miners. Always check security first.
-
Used wrong RPS mask: I initially used
f(4 CPUs) when I had 12 cores. Check withnproc! -
Forgot to persist settings: Spent 2 hours tuning, rebooted for an update, everything reset. Always create the systemd service.
-
Didn’t monitor before/after: I wish I had captured baseline metrics before starting. Always document your “before” state.
What I Learned
Why irqbalance wasn’t running
Debian/Ubuntu don’t always install irqbalance by default. It should be standard on any server handling real traffic.
The “observer effect” isn’t paranormal
When tools run on different cores, they trigger kernel activity that shifts workload. It’s not magic - it’s just the scheduler doing its job differently under load.
Network tuning is a stack
There’s no single “fix network lag” command. It’s:
- Hardware interrupt distribution (irqbalance)
- Software packet steering (RPS)
- Buffer sizing (ring buffers)
- Kernel parameters (sysctl)
- All working together
Modern NICs need modern settings
Default settings from 2010 don’t work well with 2025 network speeds. A 256-packet ring buffer is absurdly small for a gigabit NIC.
When Does This Matter?
💭 Reality check: If your homelab serves 3 users and a Plex stream, you probably don’t need this. But if you’re running:
- Multiple containers with web services
- Game servers
- VPN gateways handling real traffic
- Network-intensive applications
Then yes, IRQ balance matters. My server went from “randomly frustrating” to “rock solid” after this fix.
Rollback (If Needed)
If something breaks:
# Stop and disable our tuning service
systemctl stop network-tuning.service
systemctl disable network-tuning.service
# Reset ring buffers to defaults
ethtool -G eth0 rx 256 tx 256
# Remove sysctl tuning
rm /etc/sysctl.d/99-network-tuning.conf
sysctl -p
# Stop irqbalance
systemctl stop irqbalance
systemctl disable irqbalance
# Reboot to fully reset
reboot
Resources
Got questions? Found this helpful? Let me know - I’d love to hear your debugging stories too.
Update 2025-12-11: Initial publication