Linux Performance Analysis | Advanced Linux Administration
Slide 1 of 35 | ALA-08 | Week 4 of 8
Linux Performance Analysis
top • htop • iostat • vmstat • sar • free • /proc & /sys • Load Average
A sector node is running slow. Response times are up, users are complaining, and the on-call page just fired. You have sixty seconds to determine whether the bottleneck is CPU, memory, disk I/O, or network before the incident commander asks you for an initial assessment. This lecture gives you the tools and the mental model.
35 SlidesALA-08Week 4 of 8Ubuntu 22.04 LTS
Slide 2 of 35
The Performance Investigation Model
Start broad, narrow fast. Do not tune before you have measured.
Step 1: Observe the Whole System
top or htop for a 30-second snapshot. What is CPU doing? Is memory full? What is the load average? What processes are consuming the most? This takes 30 seconds and often identifies the bottleneck category immediately.
Step 2: Isolate the Subsystem
Based on Step 1: high CPU leads to perf and mpstat. High I/O wait leads to iostat and iotop. Memory exhaustion leads to free, vmstat, and /proc/meminfo. Network saturation leads to sar -n DEV and ss.
Step 3: Quantify and Act
Every measurement needs a baseline to be meaningful. A load average of 8 on a 4-core machine is different from a load of 8 on a 64-core machine. Know your system's normal before diagnosing abnormal. sar provides historical baselines.
Brendan Gregg's USE Method
For every resource (CPU, memory, disk, network): check Utilization, Saturation, and Errors. A resource at 90% utilization with a saturated queue and error counters ticking is a clear bottleneck. Resources with low utilization and no saturation are not your problem.
Slide 3 of 35
Load Average: The Most Misread Metric
Load average measures the demand on the CPU scheduler. Understanding it requires knowing your CPU count.
What It Measures
Linux load average = running processes + processes waiting in the run queue + processes in uninterruptible sleep (D state, usually I/O wait). A load of 1.0 on a single-core system means the CPU is exactly saturated. The same load on an 8-core system means it is 12.5% utilized.
Three Numbers
Load average is reported over three periods: 1 minute, 5 minutes, and 15 minutes. 1.23 0.87 0.54 means load increased recently (1-min is highest). 0.54 0.87 1.23 means load is decreasing. Compare the three to understand the trend direction.
When Is Load Too High?
Rule of thumb: sustained load average above the number of CPU cores indicates saturation. nproc gives you the core count. A load of 4.0 on a 4-core machine is worth investigating. A load of 4.0 on a 32-core machine is very light. Always normalize.
# Read load averagecat /proc/loadavg
# 2.34 1.87 1.42 3/287 14823# 2.34 = 1-min avg 1.87 = 5-min avg 1.42 = 15-min avg# 3/287 = 3 running / 287 total threads 14823 = last created PIDnproc# number of logical CPUs availableuptime# uptime + load averages in one line
Slide 4 of 35
top: Real-Time Process Monitor
The universal first-response tool. Learn the header fields and interactive controls cold.
# top header breakdown:
top - 14:32:01 up 12 days, 3:14 2 users load average: 2.34, 1.87, 1.42
Tasks: 287 total 3 running 284 sleeping 0 stopped 0 zombie
%Cpu(s): 23.4 us 5.1 sy 0.0 ni 68.2 id 3.3 wa 0.0 hi 0.0 si 0.0 st
MiB Mem: 32768.0 total 1024.0 free 18432.0 used 13312.0 buff/cache
MiB Swap: 8192.0 total 7680.0 free 512.0 used 14336.0 avail Mem
# CPU line fields:
# us = user space sy = kernel ni = nice id = idle
# wa = I/O wait hi = hardware interrupts si = software interrupts
# st = steal (VM hypervisor taking CPU time from this VM)
# High wa% = I/O bottleneck
# High sy% = excessive system calls or kernel work
# High st% = noisy neighbor on hypervisor -- contact cloud provider# Interactive keys (while top is running):# 1 — toggle per-CPU breakdown# M — sort by memory usage (RES)# P — sort by CPU usage (default)# k — kill a process by PID# u — filter by username# f — manage column fields# W — write current settings to ~/.toprc
Slide 5 of 35
top Process Fields: Reading the Table
Every column in top has a specific meaning. These are the ones you must know in an incident.
VIRT vs RES vs SHR
VIRT total virtual memory requested (includes mapped but unused). RES resident set size -- physical RAM actually used. SHR shared memory (shared libraries, etc.). For memory pressure, focus on RES. VIRT is usually misleadingly large.
%CPU
Percentage of a single CPU core. On an 8-core system, a process can show 800% if it uses all 8 cores. Values above 100% are normal for multithreaded processes. Divide by core count to get normalized utilization.
S (Process State)
R running or runnable (in run queue). S sleeping (waiting for event, normal). D uninterruptible sleep (I/O wait -- cannot be killed). Z zombie (exited but not reaped). T stopped (SIGSTOP or trace). Many D-state processes = I/O saturation.
# Useful top command-line optionstop -bn1 # batch mode, 1 iteration: non-interactive output (for scripts)top -p 14823# monitor specific PID onlytop -u nginx # show only processes for user 'nginx'top -d 0.5 # update every 0.5 seconds (faster refresh)# Script-friendly: top 5 CPU consumerstop -bn1 | awk'NR>7{print}' | head -5
Slide 6 of 35
htop: The Ergonomic Alternative
htop provides color-coded meters, tree views, and mouse support while showing the same data as top.
Key Advantages Over top
Per-CPU bars shown graphically. Memory bar distinguishes used/buffers/cache. Process tree view (t key) shows parent-child relationships. F5 sorts by tree structure. F6 sort by any column. F4 filter by string. Mouse click to select and kill.
htop Color Coding
CPU bar: green = user, blue = low-priority, red = kernel, yellow = IRQ. Memory bar: green = used, blue = buffers, yellow = cache. High blue on memory bar is healthy -- it means the OS is using available RAM for disk caching, which is normal and desirable.
# Install if not presentapt install htop
# Launch htophtop# Key bindings to know:# F2 — setup: configure columns, meters, color schemes# F3 or / — search for process by name# F4 — filter: show only matching processes# F5 or t — toggle tree view (parent-child relationships)# F6 — sort by selected column# u — show processes for a specific user# k — send signal to selected process# l — show open files for selected process (lsof)# s — strace selected process# i — show I/O rates for selected process (iotop-style)# H — toggle showing user/kernel threads separately
Slide 7 of 35
vmstat: Virtual Memory and CPU Scheduler Statistics
vmstat provides a dense, time-series view of memory, swap, I/O, and CPU in a single compact table.
# vmstat 1 5: 1-second intervals, 5 samplesvmstat15# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
# r b swpd free buff cache si so bi bo in cs us sy id wa st
# 2 0 0 98304 2048 524288 0 0 12 48 420 890 8 2 89 1 0
# 1 0 0 97280 2048 524288 0 0 0 144 380 750 6 1 92 1 0# Column breakdown:
# r = processes in run queue (number waiting for CPU)
# b = processes in uninterruptible sleep (I/O wait)
# swpd = virtual memory in swap (should be near 0)
# si/so = swap in/out per second (non-zero = memory pressure)
# bi/bo = blocks in/out per second (disk reads/writes)
# in = hardware interrupts per second
# cs = context switches per second (high cs = scheduler overhead)
# us/sy/id/wa = same as top CPU percentages# Warning signs:
# r consistently > CPU count = CPU saturation
# b consistently > 0 = I/O bottleneck
# si/so non-zero = swapping (memory exhaustion)
# wa consistently > 20% = disk I/O bottleneck# Disk statistics modevmstat -d 13# per-disk read/write statsvmstat -s # event summary (total context switches, interrupts, etc.)
Slide 8 of 35
free: Memory Usage Analysis
The most important column in free is "available" -- not "free". Linux uses all available RAM for caching, which is correct behavior.
free -h
# total used free shared buff/cache available
# Mem: 31Gi 18Gi 1.0Gi 512Mi 12Gi 13Gi
# Swap: 8.0Gi 500Mi 7.5Gi# Field meanings:
# total — installed physical RAM
# used — RAM used by applications (not including cache)
# free — completely unused RAM (almost always small -- this is fine)
# shared — tmpfs, shared memory segments
# buff/cache — OS disk cache + buffer cache (this is good: can be reclaimed)
# available — how much RAM a new process can actually use (free + reclaimable cache)
# DO NOT PANIC about "free" being small.
# The OS fills free RAM with disk cache to speed up reads.
# "available" is the real answer to "how much memory can I use?"
# Swap analysis:
# If Swap: used > 0, applications are being paged out.
# If Swap: used is growing over time, you have a memory leak or undersized RAM.# Monitor memory every 2 secondswatch -n2 'free -h'# Detailed memory breakdowncat /proc/meminfo | head -20
Slide 9 of 35
/proc/meminfo: Kernel Memory Accounting
The raw source of all memory statistics. Understanding the key fields lets you diagnose OOM conditions before they happen.
# Key fields from /proc/meminfocat /proc/meminfo
# MemTotal: 33554432 kB — installed RAM
# MemFree: 1048576 kB — completely unused
# MemAvailable: 13631488 kB — available for new allocations
# Buffers: 204800 kB — kernel I/O buffer cache (block devices)
# Cached: 12582912 kB — page cache (files read from disk)
# SwapCached: 0 kB — swap content also in RAM (double counted)
# Active: 18350080 kB — recently used, not easily reclaimed
# Inactive: 9175040 kB — not recently used, can be reclaimed
# Slab: 2097152 kB — kernel slab allocator (dentry/inode cache)
# SReclaimable: 1572864 kB — slab that can be freed under pressure
# SUnreclaim: 524288 kB — slab that cannot be freed
# VmallocTotal: very large — virtual address space for kernel
# HugePages_Total: 0 — 2MB huge pages configured
# DirectMap2M: 32505856 kB — direct mapped memory using 2MB pages# Script: check if OOM is imminent
AVAIL=$(awk '/MemAvailable/{print $2}' /proc/meminfo)
TOTAL=$(awk '/MemTotal/{print $2}' /proc/meminfo)
PCT=$(( 100 - (AVAIL * 100 / TOTAL) ))
(( PCT > 90 )) &&echo"CRITICAL: Memory ${PCT}% used -- OOM risk"
Slide 10 of 35
iostat: Disk I/O Statistics
iostat measures disk throughput, I/O operations per second, and wait times. The first tool to reach for when top shows high wa%.
# iostat with extended statistics, 2-second intervals, 5 samplesiostat -xz 25# Device: r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util await r_await w_await svctm aqu-sz
# sda: 10.2 45.3 4096.0 18432.0 0.5 12.3 68.4 8.23 2.10 9.50 1.82 0.37# Key fields:
# r/s — read operations per second
# w/s — write operations per second
# rkB/s — read throughput (KB/s)
# wkB/s — write throughput (KB/s)
# await — average wait time per I/O request (ms) -- KEY METRIC
# r_await / w_await — separate read and write latencies
# %util — how busy the device is (100% = saturated)
# aqu-sz — average queue depth (>1 = requests piling up)
# Warning signs:
# %util > 80% consistently = disk saturation
# await > 10ms for HDD, > 1ms for NVMe SSD = slow I/O
# aqu-sz > 1 = queue building up (worse than util alone)# CPU I/O wait from iostatiostat -c 1# CPU stats only: us sy ni id wa stealiostat -d sda 1# single device: sda
Slide 11 of 35
iotop: Per-Process I/O Monitor
iostat shows disk totals. iotop shows which specific process is responsible for the I/O load.
# Install iotopapt install iotop
# Run iotop (requires root or CAP_NET_ADMIN)iotop# interactive, sorted by I/O rateiotop -o # --only: show only processes with active I/O (cleaner)iotop -b -n5 # batch mode, 5 iterations (scriptable)# Output format:
# TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
# 8342 be/4 mysql 0.00 B/s 18.42 M/s 0.00% 12.3% mysqld# Script: find top I/O consumer and log itiotop -b -n2 -q 2>&1 | \
awk'NR>2 && ($4+0 > 1000 || $6+0 > 1000) {print $0}' | \
head -5 | \
logger -t iotop-alert
# Alternative: /proc/PID/io (per-process I/O without iotop)cat /proc/14823/io
# rchar: 1048576 (bytes read via read() calls)
# wchar: 524288 (bytes written via write() calls)
# syscr: 256 (number of read() syscalls)
# syscw: 128 (number of write() syscalls)
Slide 12 of 35
sar: System Activity Reporter
sar collects and reports historical system performance data. The only tool that lets you investigate a performance issue that happened last night.
Historical Analysis
sar records CPU, memory, I/O, and network statistics every 10 minutes by default (via a cron job that calls sa1). This data is stored in /var/log/sysstat/ for 28 days. When an incident happens at 03:00, sar has the data you need at 08:00.
Real-Time Mode
sar also works like vmstat: sar -u 1 10 shows CPU utilization every second for 10 samples. This makes sar a single tool that covers both real-time investigation and historical review.
Installation
Part of the sysstat package. On Ubuntu, after installing, enable collection in /etc/default/sysstat by setting ENABLED="true". The collection cron job then starts recording automatically.
# Install and enable sysstatapt install sysstat
sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
systemctl enable --now sysstat
# Verify data collection is runningls -la /var/log/sysstat/
Slide 13 of 35
sar: CPU and Load History
Pull yesterday's CPU utilization graph to determine when the performance problem began.
# CPU utilization from today's data filesar -u # today's CPU history (10-min intervals)sar -u 15# live: 1-second intervals, 5 samplessar -u -f /var/log/sysstat/sa08 # April 8 data file# Per-CPU breakdownsar -P ALL 15# all CPUs individuallysar -P 0,1,2,313# specific CPUs 0-3# Context switches and interruptssar -w 15# context switches per secondsar -I ALL 13# interrupt rates by interrupt number# Load average historysar -q # run queue and load average# Find the peak load window in yesterday's datasar -q -f /var/log/sysstat/sa08 | sort -k5 -rn | head -5
# Sort by runq-sz (column 5), highest first
Slide 14 of 35
sar: Memory and Disk History
Correlate the memory utilization timeline with the CPU timeline to pinpoint when the system started under stress.
/proc is a virtual filesystem. Every file in it is a live view into kernel state. Most performance tools read from it.
System-Wide /proc Files
/proc/loadavg load averages. /proc/meminfo detailed memory. /proc/stat CPU stats since boot. /proc/diskstats disk I/O counters. /proc/net/dev network interface stats. /proc/sys/ tunable kernel parameters.
Per-Process /proc/PID/
status name, state, memory. cmdline full command. fd/ open file descriptors. maps memory map. io I/O counters. net/ network sockets. cgroup cgroup membership. oom_score OOM killer priority.
/proc/sys Tunables
Read and write kernel parameters in real time. Changes take effect immediately but reset on reboot. Make permanent via /etc/sysctl.conf or files in /etc/sysctl.d/. Apply without reboot: sysctl -p.
# Examples of reading /proc directlycat /proc/stat | head -5 # raw CPU tick counters per CPUcat /proc/diskstats # raw disk I/O counters (basis for iostat)cat /proc/net/dev # raw network packet/byte counterscat /proc/14823/status # specific process memory and statecat /proc/14823/cmdline | tr'\0'' '# full command line (null-separated)ls -la /proc/14823/fd/ # all open file descriptorscat /proc/14823/oom_score # OOM killer score (higher = killed first)
Slide 16 of 35
/sys: Hardware and Driver Interface
/sys exposes hardware topology, device configuration, and kernel driver settings as a filesystem.
# CPU frequency and powercat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# governors: performance, powersave, ondemand, conservative, schedutil# Set CPU governor to performance for maximum throughputfor cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; doecho"performance">"$cpu"done# Block device queue tuningcat /sys/block/sda/queue/scheduler # current I/O schedulercat /sys/block/sda/queue/nr_requests # queue depthcat /sys/block/sda/queue/rotational # 0 = SSD, 1 = HDD# Set optimal scheduler (mq-deadline for SSD, bfq for HDD)echo"mq-deadline"> /sys/block/sda/queue/scheduler
# Network device statistics and tuningcat /sys/class/net/eth0/statistics/rx_bytes
cat /sys/class/net/eth0/statistics/rx_dropped
cat /sys/class/net/eth0/speed # link speed in Mbps# NUMA topologycat /sys/devices/system/node/possible
cat /sys/devices/system/node/node0/meminfo
Slide 17 of 35
sysctl: Kernel Parameter Tuning
sysctl provides a clean interface to read and set kernel parameters that live in /proc/sys/.
# Read a parametersysctl vm.swappiness # vm.swappiness = 60sysctl net.core.somaxconn
# Set a parameter (takes effect immediately)sysctl -w vm.swappiness=10
sysctl -w net.core.somaxconn=65536
# List all kernel parameterssysctl -a | grep vm.
# Persist in /etc/sysctl.d/99-sector.conf# vm.swappiness = 10# net.core.somaxconn = 65536# net.ipv4.tcp_max_syn_backlog = 8192# vm.dirty_ratio = 15# vm.dirty_background_ratio = 5# Apply without rebootsysctl -p /etc/sysctl.d/99-sector.conf
# Key parameters explained:
# vm.swappiness=10 — prefer RAM over swap (0=never swap, 100=swap aggressively)
# vm.dirty_ratio=15 — % RAM that can hold dirty pages before writes are forced
# net.core.somaxconn — max listen() backlog (connections pending accept())
# net.ipv4.tcp_fin_timeout — how long to hold TCP FIN sockets (reduce for busy servers)
Slide 18 of 35
perf: CPU Profiling and Event Counting
perf uses hardware performance counters to profile CPU usage at the function level, revealing where exactly time is being spent.
# Installapt install linux-perf linux-tools-generic
# perf stat: count hardware events for a commandperf stat ls /var/log
# Performance counter stats for 'ls /var/log':
# 124,832 cache-misses
# 3,284,110 cache-references (3.80% of all refs)
# 822 page-faults
# 12,834,782 instructions (1.12 insn per cycle)
# 1,432,440 branch-misses (4.23% of all branches)# perf top: live sampling profiler (like top, but shows function names)perf top # system-wide function-level profilingperf top -p 14823# profile specific process# Record a profile and analyze itperf record -g -p 14823 -- sleep 10# record with call graphs for 10 secondsperf report # interactive analysisperf report --stdio # text output (for logging)# Flame graph: perf + FlameGraph for visual profilingperf record -g -p 14823 -- sleep 30perf script | /opt/FlameGraph/stackcollapse-perf.pl | \
/opt/FlameGraph/flamegraph.pl > /tmp/flame.svg
Slide 19 of 35
mpstat: Per-CPU Statistics
mpstat shows per-CPU utilization. Single-core saturation while others sit idle indicates a single-threaded bottleneck.
# All CPUs, 1-second intervals, 5 samplesmpstat -P ALL 15# Sample output:
# CPU %usr %sys %iowait %irq %soft %idle
# all 18.4 3.2 8.1 0.2 0.4 69.7
# 0 72.0 4.1 0.0 0.0 0.0 23.9 <-- saturated
# 1 2.1 3.0 32.1 0.0 0.0 62.8 <-- I/O wait
# 2 3.2 2.8 0.0 0.0 0.0 94.0
# 3 1.2 2.8 0.0 0.0 0.0 96.0# Interpretation:
# CPU 0 at 72% user = single-threaded bottleneck (offload to more threads)
# CPU 1 at 32% iowait = I/O bound process on that CPU
# Check which PID is pinned to CPU 0: taskset -p PID# Single CPU (CPU 0 only)mpstat -P 015# Check CPU affinity of a processtaskset -pc 14823# current affinity mask# Set CPU affinity to CPUs 2 and 3 onlytaskset -pc 2,314823
Slide 20 of 35
df and du: Filesystem and Directory Space
df shows filesystem-level usage. du shows directory-level usage. Together they find what is consuming disk space.
# df: filesystem usagedf -h # human-readable: GiB, MiBdf -hT # include filesystem type (ext4, xfs, tmpfs)df -i # inode usage (can be full even if bytes are free)df -h /var/log # only the filesystem containing /var/log# Disk full but df shows space: check inode exhaustiondf -i /var/log
# Filesystem Inodes IUsed IFree IUse% Mounted on
# /dev/sda1 524288 524288 0 100% /var/log <-- inodes exhausted# du: directory usagedu -sh /var/log/* # size of each item directly under /var/logdu -sh /* 2>/dev/null # top-level directory sizesdu -sh /var/log/* | sort -rh # sorted by size, largest first# Find top 20 largest files under /varfind /var -type f -printf '%s %p\n'2>/dev/null | \
sort -rn | \
head -20 | \
awk'{printf "%.1f MB\t%s\n", $1/1048576, $2}'
Slide 21 of 35
Network Performance: ss, nethogs, and sar -n
Identify network saturation, connection storms, and which processes are consuming bandwidth.
# ss: socket statistics (replacement for netstat)ss -s # summary: total sockets by statess -tunapl # TCP/UDP, numeric, all, processes, listeningss -tp state established # established TCP connections with processss -tp state time-wait | wc -l # count TIME_WAIT (high = connection churn)# nethogs: per-process bandwidth (like iotop for network)apt install nethogs
nethogs eth0
# sar network historysar -n DEV 15# per-interface rx/tx rates livesar -n TCP 15# TCP segments and connection ratessar -n SOCK # socket counts: TCP, UDP, raw# Check interface for errors and dropsip -s link show eth0
ethtool -S eth0 | grep -i 'drop\|error\|miss'# Real-time bandwidth with ifstat (simple, useful)apt install ifstat
ifstat -i eth0 1
Slide 22 of 35
Memory Leaks: Detecting Growing Processes
A process with a memory leak grows its RES footprint continuously until OOM kills it or you restart it.
#!/usr/bin/env bash
# memory-trend.sh -- track process RES growth over timeset -euo pipefail
PID="${1:?'Usage: $0 <PID>'}"
INTERVAL=10echo"Tracking PID $PID every ${INTERVAL}s (Ctrl+C to stop)"whiletrue; do
[[ ! -d /proc/"$PID" ]] && { echo"PID $PID no longer exists"; break; }
RES="$(awk '/VmRSS/{print $2}' /proc/"${PID}"/status)"
TS="$(date +%T)"echo"${TS} PID=${PID} RES=${RES}kB ($(( RES / 1024 ))MiB)"sleep"$INTERVAL"done
# smem: memory usage breakdown with shared memory accountingapt install smem
smem -p -s rss # sort by RSSsmem --pie=rss # pie chart by RSS# valgrind: find memory leaks in a program (development/testing)valgrind --leak-check=full --track-origins=yes ./myapp
# Check OOM killer history (who got killed)journalctl -k | grep -i 'killed process\|oom'dmesg | grep -i oom
Slide 23 of 35
OOM Killer: The Last Line of Defense
When memory is truly exhausted, the Linux OOM killer selects a process to terminate. Understand and control this mechanism.
# OOM score: higher score = more likely to be killedcat /proc/14823/oom_score # current OOM score (0-1000)cat /proc/14823/oom_score_adj # adjustment (-1000 to +1000)# Make a critical process immune to OOM kill (score adjustment -1000)echo -1000 > /proc/14823/oom_score_adj
# WARNING: use sparingly -- if this process leaks memory, the OOM killer# cannot reclaim it and the system will panic instead# Make a dispensable process a preferred OOM target (+1000)echo 1000 > /proc/14823/oom_score_adj
# Via systemd service unit: set OOM adjustment for the service# [Service]# OOMScoreAdjust=-500 # protect, but not immune# See what the OOM killer chose last timejournalctl -k | grep'Out of memory\|killed process' | tail -20
# Tune OOM behavior: 0=kill process, 1=panic kernel (for embedded/critical systems)sysctl -w vm.panic_on_oom=0# For production servers, 0 is almost always correct
Slide 24 of 35
Process Priority: nice, renice, and chrt
Control how the kernel's CPU scheduler allocates CPU time to competing processes.
# nice: start a process with adjusted priority# Range: -20 (highest priority) to +19 (lowest priority)# Default: 0. Only root can use negative (higher priority) values.nice -n 19backup.sh# run backup at lowest priority (won't starve production)nice -n -5 critical.sh# higher than default (root only)# renice: change priority of a running processrenice -n 10 -p 14823# lower priority of PID 14823renice -n 10 -u backup # lower all processes for user 'backup'# ionice: I/O scheduling priorityionice -c3 -p 14823# class 3 = idle (only gets I/O when no one else wants it)ionice -c2 -n0 -p 14823# best-effort, highest priority within class# -c1 = realtime (exclusive, dangerous)
# -c2 = best-effort (default, -n 0-7 = within-class priority)
# -c3 = idle (only when disk is completely free)# chrt: real-time scheduling policies (for latency-sensitive workloads)chrt -f -p 5014823# SCHED_FIFO at priority 50 (root only)chrt -p 14823# show current scheduling policy and priority
Slide 25 of 35
cgroups: Guaranteed Resource Allocation
cgroups enforce hard resource limits: CPU, memory, I/O, and network. systemd uses them for every service.
# View the cgroup a process belongs tocat /proc/14823/cgroup
# View resource limits for a systemd service via cgroupssystemctl show nginx | grep -E 'CPU|Memory|IO|Tasks'# Set resource limits in a service unit# /etc/systemd/system/nginx.service.d/limits.conf
[Service]
# CPU: maximum 200% (2 cores worth)
CPUQuota=200%
# Memory: hard kill at 2GB
MemoryMax=2G
# Memory: start throttling at 1.5GB
MemoryHigh=1500M
# Tasks (threads): limit to 512
TasksMax=512
# I/O: limit read to 100MB/s
IOReadBandwidthMax=/dev/sda 100M
systemctl daemon-reload
systemctl restart nginx
# View current cgroup resource usagesystemd-cgtop# top-like view of cgroup resource usagecat /sys/fs/cgroup/system.slice/nginx.service/memory.current
Slide 26 of 35
strace Performance: Finding Syscall Overhead
A process spending excessive time in system calls is either doing excessive I/O or poorly batching operations.
# Count and summarize syscalls with timing (-c)strace -cp 14823# attach and summarize until Ctrl+C# % time seconds usecs/call calls errors syscall
# 68.23 1.234567 1234 1000 50 read
# 21.14 0.382940 382 1002 2 write
# 8.91 0.161234 161 1000 0 epoll_wait# Top syscall by time = your optimization target# Trace only specific syscalls that matter for performancestrace -e trace=read,write,fsync -p 14823# Find files opened repeatedly (inefficient caching)strace -e trace=open,openat -p 148232>&1 | \
grep'^openat' | awk -F'"''{print $2}' | \
sort | uniq -c | sort -rn | head -10
# Trace a command and capture all timing datastrace -T -o /tmp/strace.log ./my-slow-command
# -T adds per-syscall time; analyze with:sort -t= -k2 -rn /tmp/strace.log | head -20
Slide 27 of 35
lsof: Open Files and File Descriptor Leaks
A process that opens files without closing them will eventually hit system file descriptor limits, causing EMFILE errors.
# List all open files for a processlsof -p 14823# Count open file descriptors for a processls /proc/14823/fd | wc -l
# System-wide fd limitcat /proc/sys/fs/file-max # max total open fds across all processescat /proc/sys/fs/file-nr # allocated, freed, max# Per-process limit (ulimit)ulimit -n # current process fd limit (soft)ulimit -Hn # hard limit# Find the process with the most open fdsfor pid in /proc/[0-9]*/fd; do
COUNT="$(ls "$pid" 2>/dev/null | wc -l)"
NAME="$(cat "${pid%/fd}/comm" 2>/dev/null || echo unknown)"echo"$COUNT $NAME ${pid%/fd}"done 2>/dev/null | sort -rn | head -10
# Increase fd limit for a service in systemd unit# [Service]# LimitNOFILE=65536
Slide 28 of 35 | Applied Workflow
Incident Workflow: 60-Second Triage
A repeatable command sequence that produces a complete system health snapshot in under 60 seconds.
#!/usr/bin/env bash
# triage.sh -- 60-second system health snapshotset -euo pipefail
echo"=== SYSTEM OVERVIEW ==="uptime; hostname; date; nprocecho"=== CPU (top 5 consumers) ==="ps aux --sort=-%cpu | head -6
echo"=== MEMORY ==="free -h
ps aux --sort=-%mem | head -6
echo"=== DISK ==="df -h | grep -v tmpfs
iostat -xz 12 2>/dev/null | tail -10
echo"=== NETWORK ==="ss -s
ss -tp state established | wc -l | xargs -I{} echo "Established TCP: {}"
echo"=== LOAD AVERAGE TREND ==="sar -q --start -30min 2>/dev/null | head -15 ||cat /proc/loadavg
echo"=== RECENT ERRORS ==="journalctl -p err..alert --since "1 hour ago" -n 10 --no-pager
Slide 29 of 35
NUMA: Non-Uniform Memory Access
On multi-socket systems, memory access latency depends on which NUMA node holds the data. NUMA-aware allocation is a significant performance factor.
# Check if system has NUMA topologynumactl --hardware
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 16384 MB
# node 0 free: 4096 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 16384 MB
# node 1 free: 6144 MB
# node distances: node 0 1
# node 0: 10 21 <-- node 0 to node 1 is 2.1x slower# Run a process with memory bound to node 0numactl --membind=0 --cpunodebind=0 ./database-server
# Check NUMA statistics for memory allocation patternsnumastatnumastat -p 14823# NUMA memory usage for specific process# NUMA memory pressure: check for local vs remote allocation rationumastat | awk'/numa_miss|numa_foreign/{print $0}'# High numa_miss = process is allocating from wrong NUMA node
Slide 30 of 35
Huge Pages: Reducing TLB Pressure
For processes with large working sets (databases, JVMs), huge pages reduce TLB misses and improve throughput.
# Check current huge page configurationcat /proc/meminfo | grep -i huge
# HugePages_Total: 128
# HugePages_Free: 64
# HugePages_Rsvd: 32
# HugePages_Surp: 0
# Hugepagesize: 2048 kB (2MB per page)
# AnonHugePages: 524288 kB (Transparent Huge Pages in use)# Transparent Huge Pages (THP) -- automatic, no config neededcat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never <-- "always" means THP is active# For databases (postgres, mongodb): THP 'always' causes latency spikes# Switch to 'madvise' so only explicitly-requesting processes get huge pagesecho madvise > /sys/kernel/mm/transparent_hugepage/enabled
# Static huge pages for performance-critical applicationssysctl -w vm.nr_hugepages=256# allocate 256 x 2MB = 512MB of huge pages# Verify allocation (some may fail if memory is fragmented)grep HugePages_Free /proc/meminfo
Slide 31 of 35
Baseline Collection: Know Normal Before Investigating Abnormal
Without a baseline, you cannot distinguish a performance problem from normal operation. Collect baselines continuously.
#!/usr/bin/env bash
# baseline-snapshot.sh -- collect 5-minute performance snapshotset -euo pipefail
TS="$(date +%Y%m%d-%H%M)"
OUT="/var/lib/perf-baseline/${TS}"mkdir -p "$OUT"uptime>"${OUT}/uptime.txt"free -m >"${OUT}/memory.txt"vmstat -s >"${OUT}/vmstat-s.txt"iostat -xz 15>"${OUT}/iostat.txt"sar -u -r -b -q 15>"${OUT}/sar.txt"ss -s >"${OUT}/ss-summary.txt"ps aux --sort=-%cpu >"${OUT}/ps-cpu.txt"df -h >"${OUT}/df.txt"# Retain 7 days of snapshotsfind /var/lib/perf-baseline -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +
echo"Snapshot saved to $OUT"# Schedule this every 15 minutes via systemd timer or cron
Slide 32 of 35
Benchmarking: Measuring Subsystem Capacity
Knowing the maximum throughput of each subsystem lets you predict bottlenecks before they appear in production.
# CPU: simple POSIX benchmark with timetime dd if=/dev/zero bs=1M count=1024 | md5sum
# Disk sequential write throughputdd if=/dev/zero of=/tmp/benchmark.bin bs=1M count=4096 oflag=direct
# direct = bypass page cache (raw disk speed)# Disk sequential read throughputdd if=/tmp/benchmark.bin of=/dev/null bs=1M iflag=direct
# Random I/O with fio (install: apt install fio)fio --name=randread --ioengine=libaio --iodepth=32 \
--rw=randread --bs=4k --size=1G --numjobs=4 \
--runtime=30 --time_based --filename=/tmp/fio-test
# Network throughput between two nodes (install iperf3)# Server: iperf3 -s# Client:iperf3 -c backup-node -t 10 -P 4 # 4 parallel streams, 10 seconds# Memory bandwidth with mbw (install: apt install mbw)mbw 512 # test with 512MB array
Slide 33 of 35
Capacity Planning: Trending Toward Failure
React to problems as they develop, not after they cause incidents. Trend analysis from sar data provides the lead time.
#!/usr/bin/env bash
# capacity-trend.sh -- extract weekly averages from sysstat dataset -euo pipefail
echo"Weekly averages from sar data"echo"CPU Average Utilization (last 7 days):"for day in$(seq 0 6); do
DATE="$(date -d "$day days ago" +%d)"
FILE="/var/log/sysstat/sa${DATE}"
[[ -f "$FILE" ]] ||continue
AVG="$(sar -u -f "$FILE" | awk '/Average/{print 100-$8"%"}')"echo" $(date -d "$day days ago" +%Y-%m-%d): CPU used $AVG"doneecho"Memory high-water mark (last 7 days):"for day in$(seq 0 6); do
DATE="$(date -d "$day days ago" +%d)"
FILE="/var/log/sysstat/sa${DATE}"
[[ -f "$FILE" ]] ||continue
PEAK="$(sar -r -f "$FILE" | awk 'NR>3{pct=100*$3/($3+$2+0.001); if(pct>max){max=pct}} END{printf "%.1f%%", max}')"echo" $(date -d "$day days ago" +%Y-%m-%d): peak RAM used $PEAK"done
Slide 34 of 35 | Applied Script
Applied Script: Automated Performance Report
A weekly automated performance summary emailed to ops -- no manual review required for normal operation.
You are no longer guessing when a system runs slow. You have a methodology, a toolkit, and the mental model to identify whether the bottleneck is CPU, memory, I/O, or network -- and the commands to quantify it in under 60 seconds.
9 Facts to Carry Out of This Lecture
1Load average above your core count is worth investigating. nproc tells you the count. Always normalize load to core count before interpreting it.
2free: the "available" column is what matters -- not "free". Linux fills free RAM with disk cache. High "used" + high "buff/cache" is normal and desirable.
3vmstat: watch "b" column (D-state processes) and "si/so" (swap in/out). Non-zero si/so means you are swapping and need more RAM.
4iostat -xz: %util > 80% and await > 10ms (HDD) indicates disk saturation. Use iotop to find which process is responsible.
5sar is the only tool that covers the past. Enable sysstat and let it collect data continuously. You will need it at 08:00 for a 03:00 incident.
6/proc and /sys are the source of truth. Every performance tool reads from them. Reading them directly is always an option when tools are unavailable.
7sysctl vm.swappiness=10 for servers. Reduces swap usage, keeps working sets in RAM. Persist in /etc/sysctl.d/.
8nice -n 19 for background jobs. They will not starve interactive or production workloads. Combined with ionice -c3 for I/O isolation.
9Measure before tuning. A tuning change that improves a metric you do not have a baseline for may be making things worse. Capture baselines first, then compare.