Linux Performance Analysis | Advanced Linux Administration

Slide 1 of 35 | ALA-08 | Week 4 of 4

Linux Performance
Analysis

top • htop • iostat • vmstat • sar • free • /proc & /sys • Load Average

A sector node is running slow. Response times are up, users are complaining, and the on-call page just fired. You have sixty seconds to determine whether the bottleneck is CPU, memory, disk I/O, or network before the incident commander asks you for an initial assessment. This lecture gives you the tools and the mental model.

35 Slides ALA-08 Week 4 of 4 Ubuntu 22.04 LTS

Slide 2 of 35

The Performance Investigation Model

Start broad, narrow fast. Do not tune before you have measured.

Step 1: Observe the Whole System

top or htop for a 30-second snapshot. What is CPU doing? Is memory full? What is the load average? What processes are consuming the most? This takes 30 seconds and often identifies the bottleneck category immediately.

Step 2: Isolate the Subsystem

Based on Step 1: high CPU leads to perf and mpstat. High I/O wait leads to iostat and iotop. Memory exhaustion leads to free, vmstat, and /proc/meminfo. Network saturation leads to sar -n DEV and ss.

Step 3: Quantify and Act

Every measurement needs a baseline to be meaningful. A load average of 8 on a 4-core machine is different from a load of 8 on a 64-core machine. Know your system's normal before diagnosing abnormal. sar provides historical baselines.

Brendan Gregg's USE Method

For every resource (CPU, memory, disk, network): check Utilization, Saturation, and Errors. A resource at 90% utilization with a saturated queue and error counters ticking is a clear bottleneck. Resources with low utilization and no saturation are not your problem.

Slide 3 of 35

Load Average: The Most Misread Metric

Load average measures the demand on the CPU scheduler. Understanding it requires knowing your CPU count.

What It Measures

Linux load average = running processes + processes waiting in the run queue + processes in uninterruptible sleep (D state, usually I/O wait). A load of 1.0 on a single-core system means the CPU is exactly saturated. The same load on an 8-core system means it is 12.5% utilized.

Three Numbers

Load average is reported over three periods: 1 minute, 5 minutes, and 15 minutes. 1.23 0.87 0.54 means load increased recently (1-min is highest). 0.54 0.87 1.23 means load is decreasing. Compare the three to understand the trend direction.

When Is Load Too High?

Rule of thumb: sustained load average above the number of CPU cores indicates saturation. nproc gives you the core count. A load of 4.0 on a 4-core machine is worth investigating. A load of 4.0 on a 32-core machine is very light. Always normalize.

# Read load average
cat /proc/loadavg
# 2.34 1.87 1.42 3/287 14823
# 2.34 = 1-min avg  1.87 = 5-min avg  1.42 = 15-min avg
# 3/287 = 3 running / 287 total threads  14823 = last created PID

nproc     # number of logical CPUs available
uptime    # uptime + load averages in one line

Slide 4 of 35

top: Real-Time Process Monitor

The universal first-response tool. Learn the header fields and interactive controls cold.

# top header breakdown:
top - 14:32:01 up 12 days, 3:14  2 users  load average: 2.34, 1.87, 1.42
Tasks: 287 total   3 running  284 sleeping  0 stopped  0 zombie
%Cpu(s): 23.4 us  5.1 sy  0.0 ni  68.2 id  3.3 wa  0.0 hi  0.0 si  0.0 st
MiB Mem:  32768.0 total   1024.0 free  18432.0 used  13312.0 buff/cache
MiB Swap:  8192.0 total   7680.0 free    512.0 used  14336.0 avail Mem

# CPU line fields:
# us = user space  sy = kernel  ni = nice  id = idle
# wa = I/O wait    hi = hardware interrupts  si = software interrupts
# st = steal (VM hypervisor taking CPU time from this VM)

# High wa% = I/O bottleneck
# High sy% = excessive system calls or kernel work
# High st% = noisy neighbor on hypervisor -- contact cloud provider

# Interactive keys (while top is running):
# 1 — toggle per-CPU breakdown
# M — sort by memory usage (RES)
# P — sort by CPU usage (default)
# k — kill a process by PID
# u — filter by username
# f — manage column fields
# W — write current settings to ~/.toprc

Slide 5 of 35

top Process Fields: Reading the Table

Every column in top has a specific meaning. These are the ones you must know in an incident.

VIRT vs RES vs SHR

VIRT total virtual memory requested (includes mapped but unused). RES resident set size -- physical RAM actually used. SHR shared memory (shared libraries, etc.). For memory pressure, focus on RES. VIRT is usually misleadingly large.

%CPU

Percentage of a single CPU core. On an 8-core system, a process can show 800% if it uses all 8 cores. Values above 100% are normal for multithreaded processes. Divide by core count to get normalized utilization.

S (Process State)

R running or runnable (in run queue). S sleeping (waiting for event, normal). D uninterruptible sleep (I/O wait -- cannot be killed). Z zombie (exited but not reaped). T stopped (SIGSTOP or trace). Many D-state processes = I/O saturation.

# Useful top command-line options
top -bn1          # batch mode, 1 iteration: non-interactive output (for scripts)
top -p 14823      # monitor specific PID only
top -u nginx      # show only processes for user 'nginx'
top -d 0.5        # update every 0.5 seconds (faster refresh)

# Script-friendly: top 5 CPU consumers
top -bn1 | awk 'NR>7{print}' | head -5

Slide 6 of 35

htop: The Ergonomic Alternative

htop provides color-coded meters, tree views, and mouse support while showing the same data as top.

Key Advantages Over top

Per-CPU bars shown graphically. Memory bar distinguishes used/buffers/cache. Process tree view (t key) shows parent-child relationships. F5 sorts by tree structure. F6 sort by any column. F4 filter by string. Mouse click to select and kill.

htop Color Coding

CPU bar: green = user, blue = low-priority, red = kernel, yellow = IRQ. Memory bar: green = used, blue = buffers, yellow = cache. High blue on memory bar is healthy -- it means the OS is using available RAM for disk caching, which is normal and desirable.

# Install if not present
apt install htop

# Launch htop
htop

# Key bindings to know:
# F2        — setup: configure columns, meters, color schemes
# F3 or /   — search for process by name
# F4        — filter: show only matching processes
# F5 or t   — toggle tree view (parent-child relationships)
# F6        — sort by selected column
# u         — show processes for a specific user
# k         — send signal to selected process
# l         — show open files for selected process (lsof)
# s         — strace selected process
# i         — show I/O rates for selected process (iotop-style)
# H         — toggle showing user/kernel threads separately

Slide 7 of 35

vmstat: Virtual Memory and CPU Scheduler Statistics

vmstat provides a dense, time-series view of memory, swap, I/O, and CPU in a single compact table.

# vmstat 1 5: 1-second intervals, 5 samples
vmstat 1 5
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
# r  b   swpd   free  buff  cache   si   so    bi    bo   in   cs us sy id wa st
# 2  0      0  98304  2048 524288    0    0     12    48  420  890  8  2 89  1  0
# 1  0      0  97280  2048 524288    0    0      0   144  380  750  6  1 92  1  0

# Column breakdown:
# r  = processes in run queue (number waiting for CPU)
# b  = processes in uninterruptible sleep (I/O wait)
# swpd = virtual memory in swap (should be near 0)
# si/so = swap in/out per second (non-zero = memory pressure)
# bi/bo = blocks in/out per second (disk reads/writes)
# in = hardware interrupts per second
# cs = context switches per second (high cs = scheduler overhead)
# us/sy/id/wa = same as top CPU percentages

# Warning signs:
# r consistently > CPU count = CPU saturation
# b consistently > 0 = I/O bottleneck
# si/so non-zero = swapping (memory exhaustion)
# wa consistently > 20% = disk I/O bottleneck

# Disk statistics mode
vmstat -d 1 3     # per-disk read/write stats
vmstat -s         # event summary (total context switches, interrupts, etc.)

Slide 8 of 35

free: Memory Usage Analysis

The most important column in free is "available" -- not "free". Linux uses all available RAM for caching, which is correct behavior.

free -h
#               total        used        free      shared  buff/cache   available
# Mem:           31Gi        18Gi       1.0Gi       512Mi        12Gi        13Gi
# Swap:          8.0Gi       500Mi       7.5Gi

# Field meanings:
# total     — installed physical RAM
# used      — RAM used by applications (not including cache)
# free      — completely unused RAM (almost always small -- this is fine)
# shared    — tmpfs, shared memory segments
# buff/cache — OS disk cache + buffer cache (this is good: can be reclaimed)
# available — how much RAM a new process can actually use (free + reclaimable cache)

# DO NOT PANIC about "free" being small.
# The OS fills free RAM with disk cache to speed up reads.
# "available" is the real answer to "how much memory can I use?"

# Swap analysis:
# If Swap: used > 0, applications are being paged out.
# If Swap: used is growing over time, you have a memory leak or undersized RAM.

# Monitor memory every 2 seconds
watch -n2 'free -h'

# Detailed memory breakdown
cat /proc/meminfo | head -20

Slide 9 of 35

/proc/meminfo: Kernel Memory Accounting

The raw source of all memory statistics. Understanding the key fields lets you diagnose OOM conditions before they happen.

# Key fields from /proc/meminfo
cat /proc/meminfo
# MemTotal:       33554432 kB   — installed RAM
# MemFree:         1048576 kB   — completely unused
# MemAvailable:   13631488 kB   — available for new allocations
# Buffers:          204800 kB   — kernel I/O buffer cache (block devices)
# Cached:         12582912 kB   — page cache (files read from disk)
# SwapCached:          0 kB   — swap content also in RAM (double counted)
# Active:         18350080 kB   — recently used, not easily reclaimed
# Inactive:        9175040 kB   — not recently used, can be reclaimed
# Slab:            2097152 kB   — kernel slab allocator (dentry/inode cache)
# SReclaimable:    1572864 kB   — slab that can be freed under pressure
# SUnreclaim:       524288 kB   — slab that cannot be freed
# VmallocTotal:  very large    — virtual address space for kernel
# HugePages_Total:       0     — 2MB huge pages configured
# DirectMap2M:    32505856 kB   — direct mapped memory using 2MB pages

# Script: check if OOM is imminent
AVAIL=$(awk '/MemAvailable/{print $2}' /proc/meminfo)
TOTAL=$(awk '/MemTotal/{print $2}'     /proc/meminfo)
PCT=$(( 100 - (AVAIL * 100 / TOTAL) ))
(( PCT > 90 )) && echo "CRITICAL: Memory ${PCT}% used -- OOM risk"

Slide 10 of 35

iostat: Disk I/O Statistics

iostat measures disk throughput, I/O operations per second, and wait times. The first tool to reach for when top shows high wa%.

# iostat with extended statistics, 2-second intervals, 5 samples
iostat -xz 2 5

# Device:  r/s   w/s  rkB/s  wkB/s  rrqm/s  wrqm/s  %util  await  r_await w_await svctm aqu-sz
# sda:     10.2  45.3  4096.0 18432.0   0.5    12.3    68.4   8.23    2.10   9.50  1.82   0.37

# Key fields:
# r/s    — read operations per second
# w/s    — write operations per second
# rkB/s  — read throughput (KB/s)
# wkB/s  — write throughput (KB/s)
# await  — average wait time per I/O request (ms) -- KEY METRIC
# r_await / w_await — separate read and write latencies
# %util  — how busy the device is (100% = saturated)
# aqu-sz — average queue depth (>1 = requests piling up)

# Warning signs:
# %util > 80% consistently = disk saturation
# await > 10ms for HDD, > 1ms for NVMe SSD = slow I/O
# aqu-sz > 1 = queue building up (worse than util alone)

# CPU I/O wait from iostat
iostat -c 1     # CPU stats only: us sy ni id wa steal
iostat -d sda 1  # single device: sda

Slide 11 of 35

iotop: Per-Process I/O Monitor

iostat shows disk totals. iotop shows which specific process is responsible for the I/O load.

# Install iotop
apt install iotop

# Run iotop (requires root or CAP_NET_ADMIN)
iotop               # interactive, sorted by I/O rate
iotop -o            # --only: show only processes with active I/O (cleaner)
iotop -b -n5        # batch mode, 5 iterations (scriptable)

# Output format:
# TID   PRIO  USER  DISK READ  DISK WRITE  SWAPIN  IO  COMMAND
# 8342  be/4  mysql  0.00 B/s  18.42 M/s    0.00%  12.3%  mysqld

# Script: find top I/O consumer and log it
iotop -b -n2 -q 2>&1 | \
    awk 'NR>2 && ($4+0 > 1000 || $6+0 > 1000) {print $0}' | \
    head -5 | \
    logger -t iotop-alert

# Alternative: /proc/PID/io (per-process I/O without iotop)
cat /proc/14823/io
# rchar: 1048576      (bytes read via read() calls)
# wchar: 524288       (bytes written via write() calls)
# syscr: 256          (number of read() syscalls)
# syscw: 128          (number of write() syscalls)

Slide 12 of 35

sar: System Activity Reporter

sar collects and reports historical system performance data. The only tool that lets you investigate a performance issue that happened last night.

Historical Analysis

sar records CPU, memory, I/O, and network statistics every 10 minutes by default (via a cron job that calls sa1). This data is stored in /var/log/sysstat/ for 28 days. When an incident happens at 03:00, sar has the data you need at 08:00.

Real-Time Mode

sar also works like vmstat: sar -u 1 10 shows CPU utilization every second for 10 samples. This makes sar a single tool that covers both real-time investigation and historical review.

Installation

Part of the sysstat package. On Ubuntu, after installing, enable collection in /etc/default/sysstat by setting ENABLED="true". The collection cron job then starts recording automatically.

# Install and enable sysstat
apt install sysstat
sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
systemctl enable --now sysstat

# Verify data collection is running
ls -la /var/log/sysstat/

Slide 13 of 35

sar: CPU and Load History

Pull yesterday's CPU utilization graph to determine when the performance problem began.

# CPU utilization from today's data file
sar -u                   # today's CPU history (10-min intervals)
sar -u 1 5              # live: 1-second intervals, 5 samples
sar -u -f /var/log/sysstat/sa08   # April 8 data file

# Per-CPU breakdown
sar -P ALL 1 5          # all CPUs individually
sar -P 0,1,2,3 1 3     # specific CPUs 0-3

# Context switches and interrupts
sar -w 1 5              # context switches per second
sar -I ALL 1 3          # interrupt rates by interrupt number

# Load average history
sar -q                   # run queue and load average

# Find the peak load window in yesterday's data
sar -q -f /var/log/sysstat/sa08 | sort -k5 -rn | head -5
# Sort by runq-sz (column 5), highest first

Slide 14 of 35

sar: Memory and Disk History

Correlate the memory utilization timeline with the CPU timeline to pinpoint when the system started under stress.

# Memory utilization history
sar -r                   # memory: free, used, cached, swpd
sar -r ALL               # extended: includes huge pages, slab, etc.

# Swap usage history
sar -S                   # swap: total, used, free

# Disk I/O history
sar -b                   # I/O: tps, rtps, wtps, bread/s, bwrtn/s
sar -d                   # per-device: %util, await, tps
sar -d -f /var/log/sysstat/sa08 | grep sda

# Network history
sar -n DEV               # per-interface: rx/tx packets and bytes
sar -n EDEV              # network errors
sar -n TCP               # TCP segments, connection rates

# Complete incident review: all subsystems yesterday 02:00 to 04:00
for flag in -u -r -b -n DEV; do
    echo "=== sar $flag ==="
    sar $flag -f /var/log/sysstat/sa08 -s 02:00:00 -e 04:00:00
done

Slide 15 of 35

/proc: The Kernel's Live Data Export

/proc is a virtual filesystem. Every file in it is a live view into kernel state. Most performance tools read from it.

System-Wide /proc Files

/proc/loadavg load averages. /proc/meminfo detailed memory. /proc/stat CPU stats since boot. /proc/diskstats disk I/O counters. /proc/net/dev network interface stats. /proc/sys/ tunable kernel parameters.

Per-Process /proc/PID/

status name, state, memory. cmdline full command. fd/ open file descriptors. maps memory map. io I/O counters. net/ network sockets. cgroup cgroup membership. oom_score OOM killer priority.

/proc/sys Tunables

Read and write kernel parameters in real time. Changes take effect immediately but reset on reboot. Make permanent via /etc/sysctl.conf or files in /etc/sysctl.d/. Apply without reboot: sysctl -p.

# Examples of reading /proc directly
cat /proc/stat | head -5       # raw CPU tick counters per CPU
cat /proc/diskstats            # raw disk I/O counters (basis for iostat)
cat /proc/net/dev              # raw network packet/byte counters
cat /proc/14823/status         # specific process memory and state
cat /proc/14823/cmdline | tr '\0' ' '  # full command line (null-separated)
ls -la /proc/14823/fd/         # all open file descriptors
cat /proc/14823/oom_score      # OOM killer score (higher = killed first)

Slide 16 of 35

/sys: Hardware and Driver Interface

/sys exposes hardware topology, device configuration, and kernel driver settings as a filesystem.

# CPU frequency and power
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# governors: performance, powersave, ondemand, conservative, schedutil

# Set CPU governor to performance for maximum throughput
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo "performance" > "$cpu"
done

# Block device queue tuning
cat /sys/block/sda/queue/scheduler      # current I/O scheduler
cat /sys/block/sda/queue/nr_requests    # queue depth
cat /sys/block/sda/queue/rotational     # 0 = SSD, 1 = HDD

# Set optimal scheduler (mq-deadline for SSD, bfq for HDD)
echo "mq-deadline" > /sys/block/sda/queue/scheduler

# Network device statistics and tuning
cat /sys/class/net/eth0/statistics/rx_bytes
cat /sys/class/net/eth0/statistics/rx_dropped
cat /sys/class/net/eth0/speed    # link speed in Mbps

# NUMA topology
cat /sys/devices/system/node/possible
cat /sys/devices/system/node/node0/meminfo

Slide 17 of 35

sysctl: Kernel Parameter Tuning

sysctl provides a clean interface to read and set kernel parameters that live in /proc/sys/.

# Read a parameter
sysctl vm.swappiness          # vm.swappiness = 60
sysctl net.core.somaxconn

# Set a parameter (takes effect immediately)
sysctl -w vm.swappiness=10
sysctl -w net.core.somaxconn=65536

# List all kernel parameters
sysctl -a | grep vm.

# Persist in /etc/sysctl.d/99-sector.conf
# vm.swappiness = 10
# net.core.somaxconn = 65536
# net.ipv4.tcp_max_syn_backlog = 8192
# vm.dirty_ratio = 15
# vm.dirty_background_ratio = 5

# Apply without reboot
sysctl -p /etc/sysctl.d/99-sector.conf

# Key parameters explained:
# vm.swappiness=10    — prefer RAM over swap (0=never swap, 100=swap aggressively)
# vm.dirty_ratio=15   — % RAM that can hold dirty pages before writes are forced
# net.core.somaxconn  — max listen() backlog (connections pending accept())
# net.ipv4.tcp_fin_timeout — how long to hold TCP FIN sockets (reduce for busy servers)

Slide 18 of 35

perf: CPU Profiling and Event Counting

perf uses hardware performance counters to profile CPU usage at the function level, revealing where exactly time is being spent.

# Install
apt install linux-perf linux-tools-generic

# perf stat: count hardware events for a command
perf stat ls /var/log
# Performance counter stats for 'ls /var/log':
#   124,832  cache-misses
# 3,284,110  cache-references    (3.80% of all refs)
#       822  page-faults
# 12,834,782 instructions        (1.12 insn per cycle)
#  1,432,440 branch-misses       (4.23% of all branches)

# perf top: live sampling profiler (like top, but shows function names)
perf top                     # system-wide function-level profiling
perf top -p 14823            # profile specific process

# Record a profile and analyze it
perf record -g -p 14823 -- sleep 10   # record with call graphs for 10 seconds
perf report                            # interactive analysis
perf report --stdio                    # text output (for logging)

# Flame graph: perf + FlameGraph for visual profiling
perf record -g -p 14823 -- sleep 30
perf script | /opt/FlameGraph/stackcollapse-perf.pl | \
    /opt/FlameGraph/flamegraph.pl > /tmp/flame.svg

Slide 19 of 35

mpstat: Per-CPU Statistics

mpstat shows per-CPU utilization. Single-core saturation while others sit idle indicates a single-threaded bottleneck.

# All CPUs, 1-second intervals, 5 samples
mpstat -P ALL 1 5

# Sample output:
# CPU  %usr  %sys  %iowait  %irq  %soft  %idle
# all  18.4   3.2      8.1   0.2    0.4   69.7
#   0  72.0   4.1      0.0   0.0    0.0   23.9   <-- saturated
#   1   2.1   3.0     32.1   0.0    0.0   62.8   <-- I/O wait
#   2   3.2   2.8      0.0   0.0    0.0   94.0
#   3   1.2   2.8      0.0   0.0    0.0   96.0

# Interpretation:
# CPU 0 at 72% user = single-threaded bottleneck (offload to more threads)
# CPU 1 at 32% iowait = I/O bound process on that CPU
# Check which PID is pinned to CPU 0: taskset -p PID

# Single CPU (CPU 0 only)
mpstat -P 0 1 5

# Check CPU affinity of a process
taskset -pc 14823            # current affinity mask

# Set CPU affinity to CPUs 2 and 3 only
taskset -pc 2,3 14823

Slide 20 of 35

df and du: Filesystem and Directory Space

df shows filesystem-level usage. du shows directory-level usage. Together they find what is consuming disk space.

# df: filesystem usage
df -h                          # human-readable: GiB, MiB
df -hT                         # include filesystem type (ext4, xfs, tmpfs)
df -i                          # inode usage (can be full even if bytes are free)
df -h /var/log                 # only the filesystem containing /var/log

# Disk full but df shows space: check inode exhaustion
df -i /var/log
# Filesystem     Inodes IUsed  IFree IUse% Mounted on
# /dev/sda1      524288 524288     0  100% /var/log    <-- inodes exhausted

# du: directory usage
du -sh /var/log/*               # size of each item directly under /var/log
du -sh /* 2>/dev/null           # top-level directory sizes
du -sh /var/log/* | sort -rh   # sorted by size, largest first

# Find top 20 largest files under /var
find /var -type f -printf '%s %p\n' 2>/dev/null | \
    sort -rn | \
    head -20 | \
    awk '{printf "%.1f MB\t%s\n", $1/1048576, $2}'

Slide 21 of 35

Network Performance: ss, nethogs, and sar -n

Identify network saturation, connection storms, and which processes are consuming bandwidth.

# ss: socket statistics (replacement for netstat)
ss -s                          # summary: total sockets by state
ss -tunapl                     # TCP/UDP, numeric, all, processes, listening
ss -tp state established       # established TCP connections with process
ss -tp state time-wait | wc -l  # count TIME_WAIT (high = connection churn)

# nethogs: per-process bandwidth (like iotop for network)
apt install nethogs
nethogs eth0

# sar network history
sar -n DEV 1 5                # per-interface rx/tx rates live
sar -n TCP 1 5                # TCP segments and connection rates
sar -n SOCK                    # socket counts: TCP, UDP, raw

# Check interface for errors and drops
ip -s link show eth0
ethtool -S eth0 | grep -i 'drop\|error\|miss'

# Real-time bandwidth with ifstat (simple, useful)
apt install ifstat
ifstat -i eth0 1

Slide 22 of 35

Memory Leaks: Detecting Growing Processes

A process with a memory leak grows its RES footprint continuously until OOM kills it or you restart it.

#!/usr/bin/env bash
# memory-trend.sh -- track process RES growth over time
set -euo pipefail
PID="${1:?'Usage: $0 <PID>'}"
INTERVAL=10

echo "Tracking PID $PID every ${INTERVAL}s (Ctrl+C to stop)"
while true; do
    [[ ! -d /proc/"$PID" ]] && { echo "PID $PID no longer exists"; break; }
    RES="$(awk '/VmRSS/{print $2}' /proc/"${PID}"/status)"
    TS="$(date +%T)"
    echo "${TS} PID=${PID} RES=${RES}kB ($(( RES / 1024 ))MiB)"
    sleep "$INTERVAL"
done

# smem: memory usage breakdown with shared memory accounting
apt install smem
smem -p -s rss                  # sort by RSS
smem --pie=rss                  # pie chart by RSS

# valgrind: find memory leaks in a program (development/testing)
valgrind --leak-check=full --track-origins=yes ./myapp

# Check OOM killer history (who got killed)
journalctl -k | grep -i 'killed process\|oom'
dmesg | grep -i oom

Slide 23 of 35

OOM Killer: The Last Line of Defense

When memory is truly exhausted, the Linux OOM killer selects a process to terminate. Understand and control this mechanism.

# OOM score: higher score = more likely to be killed
cat /proc/14823/oom_score         # current OOM score (0-1000)
cat /proc/14823/oom_score_adj      # adjustment (-1000 to +1000)

# Make a critical process immune to OOM kill (score adjustment -1000)
echo -1000 > /proc/14823/oom_score_adj
# WARNING: use sparingly -- if this process leaks memory, the OOM killer
# cannot reclaim it and the system will panic instead

# Make a dispensable process a preferred OOM target (+1000)
echo 1000 > /proc/14823/oom_score_adj

# Via systemd service unit: set OOM adjustment for the service
# [Service]
# OOMScoreAdjust=-500    # protect, but not immune

# See what the OOM killer chose last time
journalctl -k | grep 'Out of memory\|killed process' | tail -20

# Tune OOM behavior: 0=kill process, 1=panic kernel (for embedded/critical systems)
sysctl -w vm.panic_on_oom=0
# For production servers, 0 is almost always correct

Slide 24 of 35

Process Priority: nice, renice, and chrt

Control how the kernel's CPU scheduler allocates CPU time to competing processes.

# nice: start a process with adjusted priority
# Range: -20 (highest priority) to +19 (lowest priority)
# Default: 0. Only root can use negative (higher priority) values.

nice -n 19 backup.sh         # run backup at lowest priority (won't starve production)
nice -n -5 critical.sh     # higher than default (root only)

# renice: change priority of a running process
renice -n 10 -p 14823        # lower priority of PID 14823
renice -n 10 -u backup       # lower all processes for user 'backup'

# ionice: I/O scheduling priority
ionice -c3 -p 14823          # class 3 = idle (only gets I/O when no one else wants it)
ionice -c2 -n0 -p 14823     # best-effort, highest priority within class
# -c1 = realtime (exclusive, dangerous)
# -c2 = best-effort (default, -n 0-7 = within-class priority)
# -c3 = idle (only when disk is completely free)

# chrt: real-time scheduling policies (for latency-sensitive workloads)
chrt -f -p 50 14823          # SCHED_FIFO at priority 50 (root only)
chrt -p 14823                # show current scheduling policy and priority

Slide 25 of 35

cgroups: Guaranteed Resource Allocation

cgroups enforce hard resource limits: CPU, memory, I/O, and network. systemd uses them for every service.

# View the cgroup a process belongs to
cat /proc/14823/cgroup

# View resource limits for a systemd service via cgroups
systemctl show nginx | grep -E 'CPU|Memory|IO|Tasks'

# Set resource limits in a service unit
# /etc/systemd/system/nginx.service.d/limits.conf
[Service]
# CPU: maximum 200% (2 cores worth)
CPUQuota=200%
# Memory: hard kill at 2GB
MemoryMax=2G
# Memory: start throttling at 1.5GB
MemoryHigh=1500M
# Tasks (threads): limit to 512
TasksMax=512
# I/O: limit read to 100MB/s
IOReadBandwidthMax=/dev/sda 100M

systemctl daemon-reload
systemctl restart nginx

# View current cgroup resource usage
systemd-cgtop               # top-like view of cgroup resource usage
cat /sys/fs/cgroup/system.slice/nginx.service/memory.current

Slide 26 of 35

strace Performance: Finding Syscall Overhead

A process spending excessive time in system calls is either doing excessive I/O or poorly batching operations.

# Count and summarize syscalls with timing (-c)
strace -cp 14823             # attach and summarize until Ctrl+C
# % time     seconds  usecs/call     calls    errors syscall
# 68.23      1.234567        1234      1000       50 read
# 21.14      0.382940         382      1002        2 write
#  8.91      0.161234         161      1000        0 epoll_wait

# Top syscall by time = your optimization target

# Trace only specific syscalls that matter for performance
strace -e trace=read,write,fsync -p 14823

# Find files opened repeatedly (inefficient caching)
strace -e trace=open,openat -p 14823 2>&1 | \
    grep '^openat' | awk -F'"' '{print $2}' | \
    sort | uniq -c | sort -rn | head -10

# Trace a command and capture all timing data
strace -T -o /tmp/strace.log ./my-slow-command
# -T adds per-syscall time; analyze with:
sort -t= -k2 -rn /tmp/strace.log | head -20

Slide 27 of 35

lsof: Open Files and File Descriptor Leaks

A process that opens files without closing them will eventually hit system file descriptor limits, causing EMFILE errors.

# List all open files for a process
lsof -p 14823

# Count open file descriptors for a process
ls /proc/14823/fd | wc -l

# System-wide fd limit
cat /proc/sys/fs/file-max        # max total open fds across all processes
cat /proc/sys/fs/file-nr         # allocated, freed, max

# Per-process limit (ulimit)
ulimit -n                        # current process fd limit (soft)
ulimit -Hn                       # hard limit

# Find the process with the most open fds
for pid in /proc/[0-9]*/fd; do
    COUNT="$(ls "$pid" 2>/dev/null | wc -l)"
    NAME="$(cat "${pid%/fd}/comm" 2>/dev/null || echo unknown)"
    echo "$COUNT $NAME ${pid%/fd}"
done 2>/dev/null | sort -rn | head -10

# Increase fd limit for a service in systemd unit
# [Service]
# LimitNOFILE=65536

Slide 28 of 35 | Applied Workflow

Incident Workflow: 60-Second Triage

A repeatable command sequence that produces a complete system health snapshot in under 60 seconds.

#!/usr/bin/env bash
# triage.sh -- 60-second system health snapshot
set -euo pipefail

echo "=== SYSTEM OVERVIEW ==="
uptime; hostname; date; nproc

echo "=== CPU (top 5 consumers) ==="
ps aux --sort=-%cpu | head -6

echo "=== MEMORY ==="
free -h
ps aux --sort=-%mem | head -6

echo "=== DISK ==="
df -h | grep -v tmpfs
iostat -xz 1 2 2>/dev/null | tail -10

echo "=== NETWORK ==="
ss -s
ss -tp state established | wc -l | xargs -I{} echo "Established TCP: {}"

echo "=== LOAD AVERAGE TREND ==="
sar -q --start -30min 2>/dev/null | head -15 || cat /proc/loadavg

echo "=== RECENT ERRORS ==="
journalctl -p err..alert --since "1 hour ago" -n 10 --no-pager

Slide 29 of 35

NUMA: Non-Uniform Memory Access

On multi-socket systems, memory access latency depends on which NUMA node holds the data. NUMA-aware allocation is a significant performance factor.

# Check if system has NUMA topology
numactl --hardware
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 0 size: 16384 MB
# node 0 free: 4096 MB
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 1 size: 16384 MB
# node 1 free: 6144 MB
# node distances: node 0  1
#                 node 0: 10 21    <-- node 0 to node 1 is 2.1x slower

# Run a process with memory bound to node 0
numactl --membind=0 --cpunodebind=0 ./database-server

# Check NUMA statistics for memory allocation patterns
numastat
numastat -p 14823    # NUMA memory usage for specific process

# NUMA memory pressure: check for local vs remote allocation ratio
numastat | awk '/numa_miss|numa_foreign/{print $0}'
# High numa_miss = process is allocating from wrong NUMA node

Slide 30 of 35

Huge Pages: Reducing TLB Pressure

For processes with large working sets (databases, JVMs), huge pages reduce TLB misses and improve throughput.

# Check current huge page configuration
cat /proc/meminfo | grep -i huge
# HugePages_Total:     128
# HugePages_Free:       64
# HugePages_Rsvd:       32
# HugePages_Surp:        0
# Hugepagesize:       2048 kB    (2MB per page)
# AnonHugePages:    524288 kB    (Transparent Huge Pages in use)

# Transparent Huge Pages (THP) -- automatic, no config needed
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never   <-- "always" means THP is active

# For databases (postgres, mongodb): THP 'always' causes latency spikes
# Switch to 'madvise' so only explicitly-requesting processes get huge pages
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# Static huge pages for performance-critical applications
sysctl -w vm.nr_hugepages=256    # allocate 256 x 2MB = 512MB of huge pages

# Verify allocation (some may fail if memory is fragmented)
grep HugePages_Free /proc/meminfo

Slide 31 of 35

Baseline Collection: Know Normal Before Investigating Abnormal

Without a baseline, you cannot distinguish a performance problem from normal operation. Collect baselines continuously.

#!/usr/bin/env bash
# baseline-snapshot.sh -- collect 5-minute performance snapshot
set -euo pipefail

TS="$(date +%Y%m%d-%H%M)"
OUT="/var/lib/perf-baseline/${TS}"
mkdir -p "$OUT"

uptime                      > "${OUT}/uptime.txt"
free -m                     > "${OUT}/memory.txt"
vmstat -s                   > "${OUT}/vmstat-s.txt"
iostat -xz 1 5              > "${OUT}/iostat.txt"
sar -u -r -b -q 1 5         > "${OUT}/sar.txt"
ss -s                       > "${OUT}/ss-summary.txt"
ps aux --sort=-%cpu         > "${OUT}/ps-cpu.txt"
df -h                       > "${OUT}/df.txt"

# Retain 7 days of snapshots
find /var/lib/perf-baseline -maxdepth 1 -type d -mtime +7 -exec rm -rf {} +
echo "Snapshot saved to $OUT"

# Schedule this every 15 minutes via systemd timer or cron

Slide 32 of 35

Benchmarking: Measuring Subsystem Capacity

Knowing the maximum throughput of each subsystem lets you predict bottlenecks before they appear in production.

# CPU: simple POSIX benchmark with time
time dd if=/dev/zero bs=1M count=1024 | md5sum

# Disk sequential write throughput
dd if=/dev/zero of=/tmp/benchmark.bin bs=1M count=4096 oflag=direct
# direct = bypass page cache (raw disk speed)

# Disk sequential read throughput
dd if=/tmp/benchmark.bin of=/dev/null bs=1M iflag=direct

# Random I/O with fio (install: apt install fio)
fio --name=randread --ioengine=libaio --iodepth=32 \
    --rw=randread --bs=4k --size=1G --numjobs=4 \
    --runtime=30 --time_based --filename=/tmp/fio-test

# Network throughput between two nodes (install iperf3)
# Server: iperf3 -s
# Client:
iperf3 -c backup-node -t 10 -P 4    # 4 parallel streams, 10 seconds

# Memory bandwidth with mbw (install: apt install mbw)
mbw 512    # test with 512MB array

Slide 33 of 35

Capacity Planning: Trending Toward Failure

React to problems as they develop, not after they cause incidents. Trend analysis from sar data provides the lead time.

#!/usr/bin/env bash
# capacity-trend.sh -- extract weekly averages from sysstat data
set -euo pipefail

echo "Weekly averages from sar data"
echo "CPU Average Utilization (last 7 days):"
for day in $(seq 0 6); do
    DATE="$(date -d "$day days ago" +%d)"
    FILE="/var/log/sysstat/sa${DATE}"
    [[ -f "$FILE" ]] || continue
    AVG="$(sar -u -f "$FILE" | awk '/Average/{print 100-$8"%"}')"
    echo "  $(date -d "$day days ago" +%Y-%m-%d): CPU used $AVG"
done

echo "Memory high-water mark (last 7 days):"
for day in $(seq 0 6); do
    DATE="$(date -d "$day days ago" +%d)"
    FILE="/var/log/sysstat/sa${DATE}"
    [[ -f "$FILE" ]] || continue
    PEAK="$(sar -r -f "$FILE" | awk 'NR>3{pct=100*$3/($3+$2+0.001); if(pct>max){max=pct}} END{printf "%.1f%%", max}')"
    echo "  $(date -d "$day days ago" +%Y-%m-%d): peak RAM used $PEAK"
done

Slide 34 of 35 | Applied Script

Applied Script: Automated Performance Report

A weekly automated performance summary emailed to ops -- no manual review required for normal operation.

#!/usr/bin/env bash
# perf-weekly-report.sh -- weekly summary for ops team
set -euo pipefail; PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

NODE="$(hostname -f)"; REPORT="/tmp/perf-report-$(date +%F).txt"

{
    echo "WEEKLY PERFORMANCE REPORT: $NODE"
    echo "Generated: $(date)"
    echo
    echo "=== CPU (7-day avg/peak) ==="
    sar -u -f /var/log/sysstat/sa$(date -d '1 day ago' +%d) | \
        awk '/Average/{printf "Avg user: %.1f%%  Avg system: %.1f%%  Avg iowait: %.1f%%\n", $3, $5, $7}'
    echo "=== Memory ==="
    free -h
    echo "=== Disk Utilization (top devices) ==="
    sar -d | awk 'NR>3 && $NF!="DEV" {if($NF+0>5) print $0}' | head -5
    echo "=== Disk Space ==="
    df -h | grep -v tmpfs | awk 'NR==1||int($5)>70'
    echo "=== Network (24h total) ==="
    sar -n DEV | awk '/Average.*eth0/{printf "eth0: %.1fMB rx  %.1fMB tx\n", $5*1024/1e6, $6*1024/1e6}'
} > "$REPORT"

mail -s "[Perf Report] $NODE $(date +%Y-%m-%d)" ops@sector.local < "$REPORT"

Slide 35 of 35 | ALA-08 Summary

Performance Analysis: What You Now Know

You are no longer guessing when a system runs slow. You have a methodology, a toolkit, and the mental model to identify whether the bottleneck is CPU, memory, I/O, or network -- and the commands to quantify it in under 60 seconds.

9 Facts to Carry Out of This Lecture

1Load average above your core count is worth investigating. nproc tells you the count. Always normalize load to core count before interpreting it.

2free: the "available" column is what matters -- not "free". Linux fills free RAM with disk cache. High "used" + high "buff/cache" is normal and desirable.

3vmstat: watch "b" column (D-state processes) and "si/so" (swap in/out). Non-zero si/so means you are swapping and need more RAM.

4iostat -xz: %util > 80% and await > 10ms (HDD) indicates disk saturation. Use iotop to find which process is responsible.

5sar is the only tool that covers the past. Enable sysstat and let it collect data continuously. You will need it at 08:00 for a 03:00 incident.

6/proc and /sys are the source of truth. Every performance tool reads from them. Reading them directly is always an option when tools are unavailable.

7sysctl vm.swappiness=10 for servers. Reduces swap usage, keeps working sets in RAM. Persist in /etc/sysctl.d/.

8nice -n 19 for background jobs. They will not starve interactive or production workloads. Combined with ionice -c3 for I/O isolation.

9Measure before tuning. A tuning change that improves a metric you do not have a baseline for may be making things worse. Capture baselines first, then compare.