Linux Performance Analysis | Advanced Linux Administration

Slide 1 of 35  |  ALA-08  |  Week 4 of 8
Linux Performance
Analysis
top  •  htop  •  iostat  •  vmstat  •  sar  •  free  •  /proc & /sys  •  Load Average
A sector node is running slow. Response times are up, users are complaining, and the on-call page just fired. You have sixty seconds to determine whether the bottleneck is CPU, memory, disk I/O, or network before the incident commander asks you for an initial assessment. This lecture gives you the tools and the mental model.
35 Slides ALA-08 Week 4 of 8 Ubuntu 22.04 LTS
Slide 2 of 35
The Performance Investigation Model
Start broad, narrow fast. Do not tune before you have measured.
1. Observe top / htop 2. Isolate CPU|MEM|IO|NET 3. Quantify Baseline + sar FIX Tune/Scale
Step 1: Observe the Whole System
top or htop for a 30-second snapshot. What is CPU doing? Is memory full? What is the load average? What processes are consuming the most? This takes 30 seconds and often identifies the bottleneck category immediately.
Step 2: Isolate the Subsystem
Based on Step 1: high CPU leads to perf and mpstat. High I/O wait leads to iostat and iotop. Memory exhaustion leads to free, vmstat, and /proc/meminfo. Network saturation leads to sar -n DEV and ss.
Step 3: Quantify and Act
Every measurement needs a baseline to be meaningful. A load average of 8 on a 4-core machine is different from a load of 8 on a 64-core machine. Know your system's normal before diagnosing abnormal. sar provides historical baselines.
Brendan Gregg's USE Method
For every resource (CPU, memory, disk, network): check Utilization, Saturation, and Errors. A resource at 90% utilization with a saturated queue and error counters ticking is a clear bottleneck. Resources with low utilization and no saturation are not your problem.
Slide 3 of 35
Load Average: The Most Misread Metric
Load average measures the demand on the CPU scheduler. Understanding it requires knowing your CPU count.
4 cores 0-2 Light 2-4 Normal 4-6 Heavy 6+ Saturated 2.34 1.87 1.42
What It Measures
Linux load average = running processes + processes waiting in the run queue + processes in uninterruptible sleep (D state, usually I/O wait). A load of 1.0 on a single-core system means the CPU is exactly saturated. The same load on an 8-core system means it is 12.5% utilized.
Three Numbers
Load average is reported over three periods: 1 minute, 5 minutes, and 15 minutes. 1.23 0.87 0.54 means load increased recently (1-min is highest). 0.54 0.87 1.23 means load is decreasing. Compare the three to understand the trend direction.
When Is Load Too High?
Rule of thumb: sustained load average above the number of CPU cores indicates saturation. nproc gives you the core count. A load of 4.0 on a 4-core machine is worth investigating. A load of 4.0 on a 32-core machine is very light. Always normalize.
# Read load average cat /proc/loadavg # 2.34 1.87 1.42 3/287 14823 # 2.34 = 1-min avg 1.87 = 5-min avg 1.42 = 15-min avg # 3/287 = 3 running / 287 total threads 14823 = last created PID nproc # number of logical CPUs available uptime # uptime + load averages in one line
Slide 4 of 35
top: Real-Time Process Monitor
The universal first-response tool. Learn the header fields and interactive controls cold.
CPU 23.4% us sy 68.2% idle wa MEM 18GB used 13GB buff/cache real-time refresh
# top header breakdown: top - 14:32:01 up 12 days, 3:14 2 users load average: 2.34, 1.87, 1.42 Tasks: 287 total 3 running 284 sleeping 0 stopped 0 zombie %Cpu(s): 23.4 us 5.1 sy 0.0 ni 68.2 id 3.3 wa 0.0 hi 0.0 si 0.0 st MiB Mem: 32768.0 total 1024.0 free 18432.0 used 13312.0 buff/cache MiB Swap: 8192.0 total 7680.0 free 512.0 used 14336.0 avail Mem # CPU line fields: # us = user space sy = kernel ni = nice id = idle # wa = I/O wait hi = hardware interrupts si = software interrupts # st = steal (VM hypervisor taking CPU time from this VM) # High wa% = I/O bottleneck # High sy% = excessive system calls or kernel work # High st% = noisy neighbor on hypervisor -- contact cloud provider # Interactive keys (while top is running): # 1 — toggle per-CPU breakdown # M — sort by memory usage (RES) # P — sort by CPU usage (default) # k — kill a process by PID # u — filter by username # f — manage column fields # W — write current settings to ~/.toprc
Slide 5 of 35
top Process Fields: Reading the Table
Every column in top has a specific meaning. These are the ones you must know in an incident.
VIRT vs RES vs SHR
VIRT total virtual memory requested (includes mapped but unused). RES resident set size -- physical RAM actually used. SHR shared memory (shared libraries, etc.). For memory pressure, focus on RES. VIRT is usually misleadingly large.
%CPU
Percentage of a single CPU core. On an 8-core system, a process can show 800% if it uses all 8 cores. Values above 100% are normal for multithreaded processes. Divide by core count to get normalized utilization.
S (Process State)
R running or runnable (in run queue). S sleeping (waiting for event, normal). D uninterruptible sleep (I/O wait -- cannot be killed). Z zombie (exited but not reaped). T stopped (SIGSTOP or trace). Many D-state processes = I/O saturation.
# Useful top command-line options top -bn1 # batch mode, 1 iteration: non-interactive output (for scripts) top -p 14823 # monitor specific PID only top -u nginx # show only processes for user 'nginx' top -d 0.5 # update every 0.5 seconds (faster refresh) # Script-friendly: top 5 CPU consumers top -bn1 | awk 'NR>7{print}' | head -5
Slide 6 of 35
htop: The Ergonomic Alternative
htop provides color-coded meters, tree views, and mouse support while showing the same data as top.
systemd (1) sshd (820) bash (3201) nginx (904) python3 (5421) CPU 1 CPU 2 CPU 3 CPU 4
Key Advantages Over top
Per-CPU bars shown graphically. Memory bar distinguishes used/buffers/cache. Process tree view (t key) shows parent-child relationships. F5 sorts by tree structure. F6 sort by any column. F4 filter by string. Mouse click to select and kill.
htop Color Coding
CPU bar: green = user, blue = low-priority, red = kernel, yellow = IRQ. Memory bar: green = used, blue = buffers, yellow = cache. High blue on memory bar is healthy -- it means the OS is using available RAM for disk caching, which is normal and desirable.
# Install if not present apt install htop # Launch htop htop # Key bindings to know: # F2 — setup: configure columns, meters, color schemes # F3 or / — search for process by name # F4 — filter: show only matching processes # F5 or t — toggle tree view (parent-child relationships) # F6 — sort by selected column # u — show processes for a specific user # k — send signal to selected process # l — show open files for selected process (lsof) # s — strace selected process # i — show I/O rates for selected process (iotop-style) # H — toggle showing user/kernel threads separately
Slide 7 of 35
vmstat: Virtual Memory and CPU Scheduler Statistics
vmstat provides a dense, time-series view of memory, swap, I/O, and CPU in a single compact table.
# vmstat 1 5: 1-second intervals, 5 samples vmstat 1 5 # procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- # r b swpd free buff cache si so bi bo in cs us sy id wa st # 2 0 0 98304 2048 524288 0 0 12 48 420 890 8 2 89 1 0 # 1 0 0 97280 2048 524288 0 0 0 144 380 750 6 1 92 1 0 # Column breakdown: # r = processes in run queue (number waiting for CPU) # b = processes in uninterruptible sleep (I/O wait) # swpd = virtual memory in swap (should be near 0) # si/so = swap in/out per second (non-zero = memory pressure) # bi/bo = blocks in/out per second (disk reads/writes) # in = hardware interrupts per second # cs = context switches per second (high cs = scheduler overhead) # us/sy/id/wa = same as top CPU percentages # Warning signs: # r consistently > CPU count = CPU saturation # b consistently > 0 = I/O bottleneck # si/so non-zero = swapping (memory exhaustion) # wa consistently > 20% = disk I/O bottleneck # Disk statistics mode vmstat -d 1 3 # per-disk read/write stats vmstat -s # event summary (total context switches, interrupts, etc.)
Slide 8 of 35
free: Memory Usage Analysis
The most important column in free is "available" -- not "free". Linux uses all available RAM for caching, which is correct behavior.
Registers <1ns L1/L2/L3 1-10ns RAM 32GB ~100ns | available Swap 8GB ~10ms | SLOW FAST SLOW
free -h # total used free shared buff/cache available # Mem: 31Gi 18Gi 1.0Gi 512Mi 12Gi 13Gi # Swap: 8.0Gi 500Mi 7.5Gi # Field meanings: # total — installed physical RAM # used — RAM used by applications (not including cache) # free — completely unused RAM (almost always small -- this is fine) # shared — tmpfs, shared memory segments # buff/cache — OS disk cache + buffer cache (this is good: can be reclaimed) # available — how much RAM a new process can actually use (free + reclaimable cache) # DO NOT PANIC about "free" being small. # The OS fills free RAM with disk cache to speed up reads. # "available" is the real answer to "how much memory can I use?" # Swap analysis: # If Swap: used > 0, applications are being paged out. # If Swap: used is growing over time, you have a memory leak or undersized RAM. # Monitor memory every 2 seconds watch -n2 'free -h' # Detailed memory breakdown cat /proc/meminfo | head -20
Slide 9 of 35
/proc/meminfo: Kernel Memory Accounting
The raw source of all memory statistics. Understanding the key fields lets you diagnose OOM conditions before they happen.
# Key fields from /proc/meminfo cat /proc/meminfo # MemTotal: 33554432 kB — installed RAM # MemFree: 1048576 kB — completely unused # MemAvailable: 13631488 kB — available for new allocations # Buffers: 204800 kB — kernel I/O buffer cache (block devices) # Cached: 12582912 kB — page cache (files read from disk) # SwapCached: 0 kB — swap content also in RAM (double counted) # Active: 18350080 kB — recently used, not easily reclaimed # Inactive: 9175040 kB — not recently used, can be reclaimed # Slab: 2097152 kB — kernel slab allocator (dentry/inode cache) # SReclaimable: 1572864 kB — slab that can be freed under pressure # SUnreclaim: 524288 kB — slab that cannot be freed # VmallocTotal: very large — virtual address space for kernel # HugePages_Total: 0 — 2MB huge pages configured # DirectMap2M: 32505856 kB — direct mapped memory using 2MB pages # Script: check if OOM is imminent AVAIL=$(awk '/MemAvailable/{print $2}' /proc/meminfo) TOTAL=$(awk '/MemTotal/{print $2}' /proc/meminfo) PCT=$(( 100 - (AVAIL * 100 / TOTAL) )) (( PCT > 90 )) && echo "CRITICAL: Memory ${PCT}% used -- OOM risk"
Slide 10 of 35
iostat: Disk I/O Statistics
iostat measures disk throughput, I/O operations per second, and wait times. The first tool to reach for when top shows high wa%.
Application VFS Block Layer I/O Scheduler Device Driver sda / nvme0n1 r/s w/s await ms %util aqu-sz iostat:
# iostat with extended statistics, 2-second intervals, 5 samples iostat -xz 2 5 # Device: r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util await r_await w_await svctm aqu-sz # sda: 10.2 45.3 4096.0 18432.0 0.5 12.3 68.4 8.23 2.10 9.50 1.82 0.37 # Key fields: # r/s — read operations per second # w/s — write operations per second # rkB/s — read throughput (KB/s) # wkB/s — write throughput (KB/s) # await — average wait time per I/O request (ms) -- KEY METRIC # r_await / w_await — separate read and write latencies # %util — how busy the device is (100% = saturated) # aqu-sz — average queue depth (>1 = requests piling up) # Warning signs: # %util > 80% consistently = disk saturation # await > 10ms for HDD, > 1ms for NVMe SSD = slow I/O # aqu-sz > 1 = queue building up (worse than util alone) # CPU I/O wait from iostat iostat -c 1 # CPU stats only: us sy ni id wa steal iostat -d sda 1 # single device: sda
Slide 11 of 35
iotop: Per-Process I/O Monitor
iostat shows disk totals. iotop shows which specific process is responsible for the I/O load.
# Install iotop apt install iotop # Run iotop (requires root or CAP_NET_ADMIN) iotop # interactive, sorted by I/O rate iotop -o # --only: show only processes with active I/O (cleaner) iotop -b -n5 # batch mode, 5 iterations (scriptable) # Output format: # TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND # 8342 be/4 mysql 0.00 B/s 18.42 M/s 0.00% 12.3% mysqld # Script: find top I/O consumer and log it iotop -b -n2 -q 2>&1 | \ awk 'NR>2 && ($4+0 > 1000 || $6+0 > 1000) {print $0}' | \ head -5 | \ logger -t iotop-alert # Alternative: /proc/PID/io (per-process I/O without iotop) cat /proc/14823/io # rchar: 1048576 (bytes read via read() calls) # wchar: 524288 (bytes written via write() calls) # syscr: 256 (number of read() syscalls) # syscw: 128 (number of write() syscalls)
Slide 12 of 35
sar: System Activity Reporter
sar collects and reports historical system performance data. The only tool that lets you investigate a performance issue that happened last night.
sa1 cron every 10m /var/log/sysstat 28 days retained sar -u/-r Query Historical Report CPU/MEM/IO/NET
Historical Analysis
sar records CPU, memory, I/O, and network statistics every 10 minutes by default (via a cron job that calls sa1). This data is stored in /var/log/sysstat/ for 28 days. When an incident happens at 03:00, sar has the data you need at 08:00.
Real-Time Mode
sar also works like vmstat: sar -u 1 10 shows CPU utilization every second for 10 samples. This makes sar a single tool that covers both real-time investigation and historical review.
Installation
Part of the sysstat package. On Ubuntu, after installing, enable collection in /etc/default/sysstat by setting ENABLED="true". The collection cron job then starts recording automatically.
# Install and enable sysstat apt install sysstat sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat systemctl enable --now sysstat # Verify data collection is running ls -la /var/log/sysstat/
Slide 13 of 35
sar: CPU and Load History
Pull yesterday's CPU utilization graph to determine when the performance problem began.
# CPU utilization from today's data file sar -u # today's CPU history (10-min intervals) sar -u 1 5 # live: 1-second intervals, 5 samples sar -u -f /var/log/sysstat/sa08 # April 8 data file # Per-CPU breakdown sar -P ALL 1 5 # all CPUs individually sar -P 0,1,2,3 1 3 # specific CPUs 0-3 # Context switches and interrupts sar -w 1 5 # context switches per second sar -I ALL 1 3 # interrupt rates by interrupt number # Load average history sar -q # run queue and load average # Find the peak load window in yesterday's data sar -q -f /var/log/sysstat/sa08 | sort -k5 -rn | head -5 # Sort by runq-sz (column 5), highest first
Slide 14 of 35
sar: Memory and Disk History
Correlate the memory utilization timeline with the CPU timeline to pinpoint when the system started under stress.
# Memory utilization history sar -r # memory: free, used, cached, swpd sar -r ALL # extended: includes huge pages, slab, etc. # Swap usage history sar -S # swap: total, used, free # Disk I/O history sar -b # I/O: tps, rtps, wtps, bread/s, bwrtn/s sar -d # per-device: %util, await, tps sar -d -f /var/log/sysstat/sa08 | grep sda # Network history sar -n DEV # per-interface: rx/tx packets and bytes sar -n EDEV # network errors sar -n TCP # TCP segments, connection rates # Complete incident review: all subsystems yesterday 02:00 to 04:00 for flag in -u -r -b -n DEV; do echo "=== sar $flag ===" sar $flag -f /var/log/sysstat/sa08 -s 02:00:00 -e 04:00:00 done
Slide 15 of 35
/proc: The Kernel's Live Data Export
/proc is a virtual filesystem. Every file in it is a live view into kernel state. Most performance tools read from it.
System-Wide /proc Files
/proc/loadavg load averages. /proc/meminfo detailed memory. /proc/stat CPU stats since boot. /proc/diskstats disk I/O counters. /proc/net/dev network interface stats. /proc/sys/ tunable kernel parameters.
Per-Process /proc/PID/
status name, state, memory. cmdline full command. fd/ open file descriptors. maps memory map. io I/O counters. net/ network sockets. cgroup cgroup membership. oom_score OOM killer priority.
/proc/sys Tunables
Read and write kernel parameters in real time. Changes take effect immediately but reset on reboot. Make permanent via /etc/sysctl.conf or files in /etc/sysctl.d/. Apply without reboot: sysctl -p.
# Examples of reading /proc directly cat /proc/stat | head -5 # raw CPU tick counters per CPU cat /proc/diskstats # raw disk I/O counters (basis for iostat) cat /proc/net/dev # raw network packet/byte counters cat /proc/14823/status # specific process memory and state cat /proc/14823/cmdline | tr '\0' ' ' # full command line (null-separated) ls -la /proc/14823/fd/ # all open file descriptors cat /proc/14823/oom_score # OOM killer score (higher = killed first)
Slide 16 of 35
/sys: Hardware and Driver Interface
/sys exposes hardware topology, device configuration, and kernel driver settings as a filesystem.
# CPU frequency and power cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # governors: performance, powersave, ondemand, conservative, schedutil # Set CPU governor to performance for maximum throughput for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo "performance" > "$cpu" done # Block device queue tuning cat /sys/block/sda/queue/scheduler # current I/O scheduler cat /sys/block/sda/queue/nr_requests # queue depth cat /sys/block/sda/queue/rotational # 0 = SSD, 1 = HDD # Set optimal scheduler (mq-deadline for SSD, bfq for HDD) echo "mq-deadline" > /sys/block/sda/queue/scheduler # Network device statistics and tuning cat /sys/class/net/eth0/statistics/rx_bytes cat /sys/class/net/eth0/statistics/rx_dropped cat /sys/class/net/eth0/speed # link speed in Mbps # NUMA topology cat /sys/devices/system/node/possible cat /sys/devices/system/node/node0/meminfo
Slide 17 of 35
sysctl: Kernel Parameter Tuning
sysctl provides a clean interface to read and set kernel parameters that live in /proc/sys/.
# Read a parameter sysctl vm.swappiness # vm.swappiness = 60 sysctl net.core.somaxconn # Set a parameter (takes effect immediately) sysctl -w vm.swappiness=10 sysctl -w net.core.somaxconn=65536 # List all kernel parameters sysctl -a | grep vm. # Persist in /etc/sysctl.d/99-sector.conf # vm.swappiness = 10 # net.core.somaxconn = 65536 # net.ipv4.tcp_max_syn_backlog = 8192 # vm.dirty_ratio = 15 # vm.dirty_background_ratio = 5 # Apply without reboot sysctl -p /etc/sysctl.d/99-sector.conf # Key parameters explained: # vm.swappiness=10 — prefer RAM over swap (0=never swap, 100=swap aggressively) # vm.dirty_ratio=15 — % RAM that can hold dirty pages before writes are forced # net.core.somaxconn — max listen() backlog (connections pending accept()) # net.ipv4.tcp_fin_timeout — how long to hold TCP FIN sockets (reduce for busy servers)
Slide 18 of 35
perf: CPU Profiling and Event Counting
perf uses hardware performance counters to profile CPU usage at the function level, revealing where exactly time is being spent.
# Install apt install linux-perf linux-tools-generic # perf stat: count hardware events for a command perf stat ls /var/log # Performance counter stats for 'ls /var/log': # 124,832 cache-misses # 3,284,110 cache-references (3.80% of all refs) # 822 page-faults # 12,834,782 instructions (1.12 insn per cycle) # 1,432,440 branch-misses (4.23% of all branches) # perf top: live sampling profiler (like top, but shows function names) perf top # system-wide function-level profiling perf top -p 14823 # profile specific process # Record a profile and analyze it perf record -g -p 14823 -- sleep 10 # record with call graphs for 10 seconds perf report # interactive analysis perf report --stdio # text output (for logging) # Flame graph: perf + FlameGraph for visual profiling perf record -g -p 14823 -- sleep 30 perf script | /opt/FlameGraph/stackcollapse-perf.pl | \ /opt/FlameGraph/flamegraph.pl > /tmp/flame.svg
Slide 19 of 35
mpstat: Per-CPU Statistics
mpstat shows per-CPU utilization. Single-core saturation while others sit idle indicates a single-threaded bottleneck.
# All CPUs, 1-second intervals, 5 samples mpstat -P ALL 1 5 # Sample output: # CPU %usr %sys %iowait %irq %soft %idle # all 18.4 3.2 8.1 0.2 0.4 69.7 # 0 72.0 4.1 0.0 0.0 0.0 23.9 <-- saturated # 1 2.1 3.0 32.1 0.0 0.0 62.8 <-- I/O wait # 2 3.2 2.8 0.0 0.0 0.0 94.0 # 3 1.2 2.8 0.0 0.0 0.0 96.0 # Interpretation: # CPU 0 at 72% user = single-threaded bottleneck (offload to more threads) # CPU 1 at 32% iowait = I/O bound process on that CPU # Check which PID is pinned to CPU 0: taskset -p PID # Single CPU (CPU 0 only) mpstat -P 0 1 5 # Check CPU affinity of a process taskset -pc 14823 # current affinity mask # Set CPU affinity to CPUs 2 and 3 only taskset -pc 2,3 14823
Slide 20 of 35
df and du: Filesystem and Directory Space
df shows filesystem-level usage. du shows directory-level usage. Together they find what is consuming disk space.
# df: filesystem usage df -h # human-readable: GiB, MiB df -hT # include filesystem type (ext4, xfs, tmpfs) df -i # inode usage (can be full even if bytes are free) df -h /var/log # only the filesystem containing /var/log # Disk full but df shows space: check inode exhaustion df -i /var/log # Filesystem Inodes IUsed IFree IUse% Mounted on # /dev/sda1 524288 524288 0 100% /var/log <-- inodes exhausted # du: directory usage du -sh /var/log/* # size of each item directly under /var/log du -sh /* 2>/dev/null # top-level directory sizes du -sh /var/log/* | sort -rh # sorted by size, largest first # Find top 20 largest files under /var find /var -type f -printf '%s %p\n' 2>/dev/null | \ sort -rn | \ head -20 | \ awk '{printf "%.1f MB\t%s\n", $1/1048576, $2}'
Slide 21 of 35
Network Performance: ss, nethogs, and sar -n
Identify network saturation, connection storms, and which processes are consuming bandwidth.
# ss: socket statistics (replacement for netstat) ss -s # summary: total sockets by state ss -tunapl # TCP/UDP, numeric, all, processes, listening ss -tp state established # established TCP connections with process ss -tp state time-wait | wc -l # count TIME_WAIT (high = connection churn) # nethogs: per-process bandwidth (like iotop for network) apt install nethogs nethogs eth0 # sar network history sar -n DEV 1 5 # per-interface rx/tx rates live sar -n TCP 1 5 # TCP segments and connection rates sar -n SOCK # socket counts: TCP, UDP, raw # Check interface for errors and drops ip -s link show eth0 ethtool -S eth0 | grep -i 'drop\|error\|miss' # Real-time bandwidth with ifstat (simple, useful) apt install ifstat ifstat -i eth0 1
Slide 22 of 35
Memory Leaks: Detecting Growing Processes
A process with a memory leak grows its RES footprint continuously until OOM kills it or you restart it.
#!/usr/bin/env bash # memory-trend.sh -- track process RES growth over time set -euo pipefail PID="${1:?'Usage: $0 <PID>'}" INTERVAL=10 echo "Tracking PID $PID every ${INTERVAL}s (Ctrl+C to stop)" while true; do [[ ! -d /proc/"$PID" ]] && { echo "PID $PID no longer exists"; break; } RES="$(awk '/VmRSS/{print $2}' /proc/"${PID}"/status)" TS="$(date +%T)" echo "${TS} PID=${PID} RES=${RES}kB ($(( RES / 1024 ))MiB)" sleep "$INTERVAL" done
# smem: memory usage breakdown with shared memory accounting apt install smem smem -p -s rss # sort by RSS smem --pie=rss # pie chart by RSS # valgrind: find memory leaks in a program (development/testing) valgrind --leak-check=full --track-origins=yes ./myapp # Check OOM killer history (who got killed) journalctl -k | grep -i 'killed process\|oom' dmesg | grep -i oom
Slide 23 of 35
OOM Killer: The Last Line of Defense
When memory is truly exhausted, the Linux OOM killer selects a process to terminate. Understand and control this mechanism.
# OOM score: higher score = more likely to be killed cat /proc/14823/oom_score # current OOM score (0-1000) cat /proc/14823/oom_score_adj # adjustment (-1000 to +1000) # Make a critical process immune to OOM kill (score adjustment -1000) echo -1000 > /proc/14823/oom_score_adj # WARNING: use sparingly -- if this process leaks memory, the OOM killer # cannot reclaim it and the system will panic instead # Make a dispensable process a preferred OOM target (+1000) echo 1000 > /proc/14823/oom_score_adj # Via systemd service unit: set OOM adjustment for the service # [Service] # OOMScoreAdjust=-500 # protect, but not immune # See what the OOM killer chose last time journalctl -k | grep 'Out of memory\|killed process' | tail -20 # Tune OOM behavior: 0=kill process, 1=panic kernel (for embedded/critical systems) sysctl -w vm.panic_on_oom=0 # For production servers, 0 is almost always correct
Slide 24 of 35
Process Priority: nice, renice, and chrt
Control how the kernel's CPU scheduler allocates CPU time to competing processes.
PRIORITY -20 Highest 0 Default +19 Lowest ionice: RT (class 1) BE (class 2) Idle (class 3) chrt FIFO / RR real-time
# nice: start a process with adjusted priority # Range: -20 (highest priority) to +19 (lowest priority) # Default: 0. Only root can use negative (higher priority) values. nice -n 19 backup.sh # run backup at lowest priority (won't starve production) nice -n -5 critical.sh # higher than default (root only) # renice: change priority of a running process renice -n 10 -p 14823 # lower priority of PID 14823 renice -n 10 -u backup # lower all processes for user 'backup' # ionice: I/O scheduling priority ionice -c3 -p 14823 # class 3 = idle (only gets I/O when no one else wants it) ionice -c2 -n0 -p 14823 # best-effort, highest priority within class # -c1 = realtime (exclusive, dangerous) # -c2 = best-effort (default, -n 0-7 = within-class priority) # -c3 = idle (only when disk is completely free) # chrt: real-time scheduling policies (for latency-sensitive workloads) chrt -f -p 50 14823 # SCHED_FIFO at priority 50 (root only) chrt -p 14823 # show current scheduling policy and priority
Slide 25 of 35
cgroups: Guaranteed Resource Allocation
cgroups enforce hard resource limits: CPU, memory, I/O, and network. systemd uses them for every service.
# View the cgroup a process belongs to cat /proc/14823/cgroup # View resource limits for a systemd service via cgroups systemctl show nginx | grep -E 'CPU|Memory|IO|Tasks' # Set resource limits in a service unit # /etc/systemd/system/nginx.service.d/limits.conf [Service] # CPU: maximum 200% (2 cores worth) CPUQuota=200% # Memory: hard kill at 2GB MemoryMax=2G # Memory: start throttling at 1.5GB MemoryHigh=1500M # Tasks (threads): limit to 512 TasksMax=512 # I/O: limit read to 100MB/s IOReadBandwidthMax=/dev/sda 100M systemctl daemon-reload systemctl restart nginx # View current cgroup resource usage systemd-cgtop # top-like view of cgroup resource usage cat /sys/fs/cgroup/system.slice/nginx.service/memory.current
Slide 26 of 35
strace Performance: Finding Syscall Overhead
A process spending excessive time in system calls is either doing excessive I/O or poorly batching operations.
# Count and summarize syscalls with timing (-c) strace -cp 14823 # attach and summarize until Ctrl+C # % time seconds usecs/call calls errors syscall # 68.23 1.234567 1234 1000 50 read # 21.14 0.382940 382 1002 2 write # 8.91 0.161234 161 1000 0 epoll_wait # Top syscall by time = your optimization target # Trace only specific syscalls that matter for performance strace -e trace=read,write,fsync -p 14823 # Find files opened repeatedly (inefficient caching) strace -e trace=open,openat -p 14823 2>&1 | \ grep '^openat' | awk -F'"' '{print $2}' | \ sort | uniq -c | sort -rn | head -10 # Trace a command and capture all timing data strace -T -o /tmp/strace.log ./my-slow-command # -T adds per-syscall time; analyze with: sort -t= -k2 -rn /tmp/strace.log | head -20
Slide 27 of 35
lsof: Open Files and File Descriptor Leaks
A process that opens files without closing them will eventually hit system file descriptor limits, causing EMFILE errors.
# List all open files for a process lsof -p 14823 # Count open file descriptors for a process ls /proc/14823/fd | wc -l # System-wide fd limit cat /proc/sys/fs/file-max # max total open fds across all processes cat /proc/sys/fs/file-nr # allocated, freed, max # Per-process limit (ulimit) ulimit -n # current process fd limit (soft) ulimit -Hn # hard limit # Find the process with the most open fds for pid in /proc/[0-9]*/fd; do COUNT="$(ls "$pid" 2>/dev/null | wc -l)" NAME="$(cat "${pid%/fd}/comm" 2>/dev/null || echo unknown)" echo "$COUNT $NAME ${pid%/fd}" done 2>/dev/null | sort -rn | head -10 # Increase fd limit for a service in systemd unit # [Service] # LimitNOFILE=65536
Slide 28 of 35  |  Applied Workflow
Incident Workflow: 60-Second Triage
A repeatable command sequence that produces a complete system health snapshot in under 60 seconds.
#!/usr/bin/env bash # triage.sh -- 60-second system health snapshot set -euo pipefail echo "=== SYSTEM OVERVIEW ===" uptime; hostname; date; nproc echo "=== CPU (top 5 consumers) ===" ps aux --sort=-%cpu | head -6 echo "=== MEMORY ===" free -h ps aux --sort=-%mem | head -6 echo "=== DISK ===" df -h | grep -v tmpfs iostat -xz 1 2 2>/dev/null | tail -10 echo "=== NETWORK ===" ss -s ss -tp state established | wc -l | xargs -I{} echo "Established TCP: {}" echo "=== LOAD AVERAGE TREND ===" sar -q --start -30min 2>/dev/null | head -15 || cat /proc/loadavg echo "=== RECENT ERRORS ===" journalctl -p err..alert --since "1 hour ago" -n 10 --no-pager
Slide 29 of 35
NUMA: Non-Uniform Memory Access
On multi-socket systems, memory access latency depends on which NUMA node holds the data. NUMA-aware allocation is a significant performance factor.
# Check if system has NUMA topology numactl --hardware # available: 2 nodes (0-1) # node 0 cpus: 0 1 2 3 4 5 6 7 # node 0 size: 16384 MB # node 0 free: 4096 MB # node 1 cpus: 8 9 10 11 12 13 14 15 # node 1 size: 16384 MB # node 1 free: 6144 MB # node distances: node 0 1 # node 0: 10 21 <-- node 0 to node 1 is 2.1x slower # Run a process with memory bound to node 0 numactl --membind=0 --cpunodebind=0 ./database-server # Check NUMA statistics for memory allocation patterns numastat numastat -p 14823 # NUMA memory usage for specific process # NUMA memory pressure: check for local vs remote allocation ratio numastat | awk '/numa_miss|numa_foreign/{print $0}' # High numa_miss = process is allocating from wrong NUMA node
Slide 30 of 35
Huge Pages: Reducing TLB Pressure
For processes with large working sets (databases, JVMs), huge pages reduce TLB misses and improve throughput.
# Check current huge page configuration cat /proc/meminfo | grep -i huge # HugePages_Total: 128 # HugePages_Free: 64 # HugePages_Rsvd: 32 # HugePages_Surp: 0 # Hugepagesize: 2048 kB (2MB per page) # AnonHugePages: 524288 kB (Transparent Huge Pages in use) # Transparent Huge Pages (THP) -- automatic, no config needed cat /sys/kernel/mm/transparent_hugepage/enabled # [always] madvise never <-- "always" means THP is active # For databases (postgres, mongodb): THP 'always' causes latency spikes # Switch to 'madvise' so only explicitly-requesting processes get huge pages echo madvise > /sys/kernel/mm/transparent_hugepage/enabled # Static huge pages for performance-critical applications sysctl -w vm.nr_hugepages=256 # allocate 256 x 2MB = 512MB of huge pages # Verify allocation (some may fail if memory is fragmented) grep HugePages_Free /proc/meminfo
Slide 31 of 35
Baseline Collection: Know Normal Before Investigating Abnormal
Without a baseline, you cannot distinguish a performance problem from normal operation. Collect baselines continuously.
#!/usr/bin/env bash # baseline-snapshot.sh -- collect 5-minute performance snapshot set -euo pipefail TS="$(date +%Y%m%d-%H%M)" OUT="/var/lib/perf-baseline/${TS}" mkdir -p "$OUT" uptime > "${OUT}/uptime.txt" free -m > "${OUT}/memory.txt" vmstat -s > "${OUT}/vmstat-s.txt" iostat -xz 1 5 > "${OUT}/iostat.txt" sar -u -r -b -q 1 5 > "${OUT}/sar.txt" ss -s > "${OUT}/ss-summary.txt" ps aux --sort=-%cpu > "${OUT}/ps-cpu.txt" df -h > "${OUT}/df.txt" # Retain 7 days of snapshots find /var/lib/perf-baseline -maxdepth 1 -type d -mtime +7 -exec rm -rf {} + echo "Snapshot saved to $OUT" # Schedule this every 15 minutes via systemd timer or cron
Slide 32 of 35
Benchmarking: Measuring Subsystem Capacity
Knowing the maximum throughput of each subsystem lets you predict bottlenecks before they appear in production.
# CPU: simple POSIX benchmark with time time dd if=/dev/zero bs=1M count=1024 | md5sum # Disk sequential write throughput dd if=/dev/zero of=/tmp/benchmark.bin bs=1M count=4096 oflag=direct # direct = bypass page cache (raw disk speed) # Disk sequential read throughput dd if=/tmp/benchmark.bin of=/dev/null bs=1M iflag=direct # Random I/O with fio (install: apt install fio) fio --name=randread --ioengine=libaio --iodepth=32 \ --rw=randread --bs=4k --size=1G --numjobs=4 \ --runtime=30 --time_based --filename=/tmp/fio-test # Network throughput between two nodes (install iperf3) # Server: iperf3 -s # Client: iperf3 -c backup-node -t 10 -P 4 # 4 parallel streams, 10 seconds # Memory bandwidth with mbw (install: apt install mbw) mbw 512 # test with 512MB array
Slide 33 of 35
Capacity Planning: Trending Toward Failure
React to problems as they develop, not after they cause incidents. Trend analysis from sar data provides the lead time.
#!/usr/bin/env bash # capacity-trend.sh -- extract weekly averages from sysstat data set -euo pipefail echo "Weekly averages from sar data" echo "CPU Average Utilization (last 7 days):" for day in $(seq 0 6); do DATE="$(date -d "$day days ago" +%d)" FILE="/var/log/sysstat/sa${DATE}" [[ -f "$FILE" ]] || continue AVG="$(sar -u -f "$FILE" | awk '/Average/{print 100-$8"%"}')" echo " $(date -d "$day days ago" +%Y-%m-%d): CPU used $AVG" done echo "Memory high-water mark (last 7 days):" for day in $(seq 0 6); do DATE="$(date -d "$day days ago" +%d)" FILE="/var/log/sysstat/sa${DATE}" [[ -f "$FILE" ]] || continue PEAK="$(sar -r -f "$FILE" | awk 'NR>3{pct=100*$3/($3+$2+0.001); if(pct>max){max=pct}} END{printf "%.1f%%", max}')" echo " $(date -d "$day days ago" +%Y-%m-%d): peak RAM used $PEAK" done
Slide 34 of 35  |  Applied Script
Applied Script: Automated Performance Report
A weekly automated performance summary emailed to ops -- no manual review required for normal operation.
#!/usr/bin/env bash # perf-weekly-report.sh -- weekly summary for ops team set -euo pipefail; PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin NODE="$(hostname -f)"; REPORT="/tmp/perf-report-$(date +%F).txt" { echo "WEEKLY PERFORMANCE REPORT: $NODE" echo "Generated: $(date)" echo echo "=== CPU (7-day avg/peak) ===" sar -u -f /var/log/sysstat/sa$(date -d '1 day ago' +%d) | \ awk '/Average/{printf "Avg user: %.1f%% Avg system: %.1f%% Avg iowait: %.1f%%\n", $3, $5, $7}' echo "=== Memory ===" free -h echo "=== Disk Utilization (top devices) ===" sar -d | awk 'NR>3 && $NF!="DEV" {if($NF+0>5) print $0}' | head -5 echo "=== Disk Space ===" df -h | grep -v tmpfs | awk 'NR==1||int($5)>70' echo "=== Network (24h total) ===" sar -n DEV | awk '/Average.*eth0/{printf "eth0: %.1fMB rx %.1fMB tx\n", $5*1024/1e6, $6*1024/1e6}' } > "$REPORT" mail -s "[Perf Report] $NODE $(date +%Y-%m-%d)" ops@sector.local < "$REPORT"
Slide 35 of 35  |  ALA-08 Summary
Performance Analysis: What You Now Know
You are no longer guessing when a system runs slow. You have a methodology, a toolkit, and the mental model to identify whether the bottleneck is CPU, memory, I/O, or network -- and the commands to quantify it in under 60 seconds.
1Load average above your core count is worth investigating. nproc tells you the count. Always normalize load to core count before interpreting it.
2free: the "available" column is what matters -- not "free". Linux fills free RAM with disk cache. High "used" + high "buff/cache" is normal and desirable.
3vmstat: watch "b" column (D-state processes) and "si/so" (swap in/out). Non-zero si/so means you are swapping and need more RAM.
4iostat -xz: %util > 80% and await > 10ms (HDD) indicates disk saturation. Use iotop to find which process is responsible.
5sar is the only tool that covers the past. Enable sysstat and let it collect data continuously. You will need it at 08:00 for a 03:00 incident.
6/proc and /sys are the source of truth. Every performance tool reads from them. Reading them directly is always an option when tools are unavailable.
7sysctl vm.swappiness=10 for servers. Reduces swap usage, keeps working sets in RAM. Persist in /etc/sysctl.d/.
8nice -n 19 for background jobs. They will not starve interactive or production workloads. Combined with ionice -c3 for I/O isolation.
9Measure before tuning. A tuning change that improves a metric you do not have a baseline for may be making things worse. Capture baselines first, then compare.