Grid Diagnostics | Advanced Linux Administration

Slide 1 of 31 | ALA-04 | Week 1 of 4

Grid Diagnostics
Network Troubleshooting Tools

ss • ip route • traceroute • dig • tcpdump • nmap • DNS Debugging

Sector command reports a degraded connection. Services are running. The network interface is up. Something between here and the remote node is broken. This module is the toolkit you use to find out exactly what, exactly where, and exactly how to prove it.

31 Slides ALA-04 Week 1 of 4 Ubuntu 22.04 LTS

Slide 2 of 31

Diagnostic Methodology

Work from the bottom of the stack up. Never skip a layer.

Layer 1: Physical

Is the interface up? Is there a carrier? ip link show for state, ethtool for speed/duplex. Cable seated, switch port active? You cannot fix a Layer 3 problem if Layer 1 is broken.

Layers 2-3: Local Connectivity

Can you reach the default gateway? Ping it. If yes, the local link is good. If no, check ARP (ip neigh), IP address, subnet mask, and routing table. Most "network is down" incidents are Layer 3 misconfigurations on the local host.

Layers 4-7: Service Connectivity

Can you reach the remote service? Is the port open? Is DNS resolving? Use ss, dig, nc, and curl to isolate between transport and application problems. tcpdump reveals what is actually on the wire.

The Golden Rule

Confirm what you find at each layer before moving up. If you cannot ping the gateway, do not dig into DNS. If DNS is resolving, do not run tcpdump for connection problems -- start with ss and nc instead. Systematic saves time. Jumping around wastes it.

Slide 3 of 31

ss — Socket Statistics

The modern replacement for netstat. Faster, more detailed, actively maintained.

Why ss Replaced netstat

netstat reads from /proc/net/tcp sequentially. On a busy server with thousands of connections, it is slow. ss uses netlink sockets directly with kernel socket state. It is significantly faster and exposes more detail, including socket memory and TCP state internals.

What ss Shows

All sockets: TCP, UDP, Unix domain sockets. For each: state (LISTEN, ESTABLISHED, TIME_WAIT, etc.), local address:port, peer address:port, process that owns the socket, and socket buffer sizes. All the information you need to verify a service is actually listening where you expect.

# Show all listening TCP and UDP sockets with process info
ss -tuln

# -t = TCP  -u = UDP  -l = listening only  -n = numeric (no DNS lookup)

# Show all established TCP connections with process names
ss -tp state established

# Show all sockets (listening + connected) with process info
ss -tulnp

# Show only sockets on a specific port
ss -tnp sport = :443

# Show all sockets to/from a specific IP
ss -tn dst 203.0.113.50

Slide 4 of 31

ss: Filters and TCP States

Narrow the output to exactly the connections you are investigating.

# Filter by TCP state
ss -tn state time-wait               # sockets in TIME_WAIT
ss -tn state established             # active connections
ss -tn state syn-recv                # SYN received, not yet ACK'd (SYN flood indicator)

# Multiple state filters
ss -tn state time-wait state close-wait

# Count connections by state (great for health checks)
ss -tn | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn

# Show Unix domain sockets
ss -xl                               # Unix listening sockets
ss -xp                               # Unix sockets with process info

# Show socket memory (send/recv buffer usage)
ss -tm state established

# Find which process is listening on a specific port
ss -tlnp | grep ':8080'

Slide 5 of 31

ss: Practical Investigation Scenarios

Real questions you answer with ss during incident response.

# Q: Is nginx actually listening on port 443?
ss -tlnp | grep nginx
ss -tlnp | grep ':443'

# Q: How many clients are currently connected to this service?
ss -tn state established dst :443 | wc -l

# Q: Is there a SYN flood happening?
ss -tn state syn-recv | wc -l       # normal = 0; hundreds = flood

# Q: Are there too many TIME_WAIT sockets?
ss -tn state time-wait | wc -l      # over 5000 may indicate connection thrashing

# Q: What is consuming all the file descriptors (fd)?
ss -tp | awk '{print $NF}' | grep -oP 'pid=\K[0-9]+' | \
    sort | uniq -c | sort -rn | head

# Q: Is the database accepting connections from the API?
ss -tnp state established dst 10.0.100.5 sport = :5432

Slide 6 of 31

ip route: Routing Table Diagnostics

Confirm the kernel will send packets where you expect. Routing surprises cause most connectivity failures.

# Show the complete routing table
ip route show

# The single most useful routing diagnostic: show which route would be used for a destination
ip route get 8.8.8.8
# 8.8.8.8 via 192.168.10.1 dev enp3s0 src 192.168.10.50 uid 1000
# This tells you: outgoing interface (enp3s0), next-hop (192.168.10.1), source IP

ip route get 10.0.100.5
# If this says "unreachable" you have a missing route

# Show all routing tables (policy routing)
ip route show table all

# Show routing policy rules
ip rule show

# Show cache (recent route lookups)
ip route show cache

# Check if there are multiple default gateways (routing conflict)
ip route show | grep '^default'
# Multiple "default via" lines = potential conflict

Slide 7 of 31

ping: Beyond "Is it alive?"

ping tells you RTT, packet loss, TTL, and path problems. Use its flags.

# Basic ping with limited count
ping -c 4 192.168.10.1

# Flood ping: send as fast as possible (root required)
ping -f -c 1000 192.168.10.1        # stress test; . = sent, backspace = received

# Set specific packet size (test MTU path)
ping -s 1400 -c 10 192.168.10.1    # 1400 byte payload

# Don't fragment (DF bit): MTU path discovery
ping -M do -s 8972 -c 3 192.168.10.1 # fails if any hop has MTU < 9000

# Set TTL: probe specific hops
ping -t 1 8.8.8.8                  # TTL=1: reply from first hop (gateway)
ping -t 2 8.8.8.8                  # TTL=2: reply from second hop

# IPv6 ping
ping6 -c 4 2001:4860:4860::8888

# Interpret output:
# rtt min/avg/max/mdev = 1.2/1.8/2.4/0.4 ms
# mdev (mean deviation) > 10ms with avg < 5ms = jitter problem
# packet loss > 0% = congestion, hardware fault, or firewall drop

Slide 8 of 31

traceroute and tracepath

Map the path from your node to a destination. Identify where packets die or slow down.

traceroute

Sends packets with increasing TTL values (1, 2, 3...). Each router that decrements TTL to 0 sends an ICMP TTL Exceeded reply, revealing its address and RTT. Default uses UDP on high ports. Use -I for ICMP, -T for TCP SYN (bypasses some firewalls).

tracepath

Similar to traceroute but also discovers Path MTU. Does not require root (unlike traceroute with raw socket options). Less configurable but sufficient for most diagnostics. Shows MTU at each hop, which helps diagnose fragmentation problems.

# Standard traceroute (UDP by default)
traceroute 8.8.8.8

# ICMP mode (like ping -- less likely blocked by firewalls on some hops)
traceroute -I 8.8.8.8

# TCP SYN mode on port 443 (evades firewalls that block ICMP/UDP)
traceroute -T -p 443 8.8.8.8

# Numeric only (no DNS lookups -- much faster)
traceroute -n 8.8.8.8

# Set initial TTL and max hops
traceroute -f 5 -m 20 8.8.8.8      # start at TTL 5, max 20 hops

# tracepath: path MTU discovery included
tracepath 8.8.8.8
tracepath6 2001:4860:4860::8888   # IPv6 path

Slide 9 of 31

Reading Traceroute Output

Stars, latency jumps, and asymmetric paths explained.

# Sample output with annotations:
traceroute to 8.8.8.8, 30 hops max
192.168.10.1      1.2 ms  0.9 ms  1.1 ms   # gateway, ~1ms = LAN
10.1.1.1          8.4 ms  8.1 ms  8.2 ms   # ISP first hop
* * *                                        # ICMP filtered -- hop exists but won't reply
72.14.194.1       9.1 ms  9.3 ms  9.0 ms   # Google edge, still normal
8.8.8.8          10.2 ms  9.9 ms 10.1 ms   # destination

Stars (* * *)

Three stars means no reply within the timeout. The hop may exist but is configured not to reply to ICMP (common on ISP routers). Does NOT mean packets are dropped for real traffic -- the probe packets are different from actual data. A star followed by a responding hop is normal.

Sudden Latency Jump

A big increase (e.g., 1 ms to 80 ms) between two consecutive hops indicates where the slow link or congested segment is. The subsequent hops that show similar latency confirm the bottleneck is at that link, not oscillating. Pinpoint the segment, then investigate the cause.

Stars to the End

If all hops after a certain point return stars and you never reach the destination, packets are being dropped at or after that hop. Could be a firewall rule, a black hole route, or a routing loop. Compare to a traceroute from a different source to triangulate.

Slide 10 of 31

dig — DNS Interrogation Tool

The authoritative tool for DNS queries. Shows the full server response, not just the answer.

Why dig over nslookup

nslookup is deprecated and inconsistent. dig is the standard for DNS diagnostics. It shows the complete DNS response: QUESTION, ANSWER, AUTHORITY, and ADDITIONAL sections, plus response flags (QR, AA, TC, RD, RA), TTL values, and the responding server address.

Reading dig Output

The ANSWER section contains the records returned. NOERROR = query processed. NXDOMAIN = name does not exist. SERVFAIL = DNS server error. REFUSED = server refused the query. AA flag = Authoritative Answer from the zone's own server.

# Query A record (IPv4 address)
dig example.com A

# Short output: answer only
dig +short example.com

# Query a specific record type
dig example.com MX            # mail exchanger records
dig example.com AAAA          # IPv6 address records
dig example.com NS            # authoritative name servers
dig example.com TXT           # text records (SPF, DKIM, verification tokens)
dig example.com SOA           # Start of Authority (zone metadata)

# Reverse lookup: IP to hostname
dig -x 8.8.8.8               # PTR record lookup

Slide 11 of 31

dig: Advanced Query Options

Query specific servers, trace the resolution chain, and control output format.

# Query a specific DNS server (not your default resolver)
dig @8.8.8.8 example.com               # query Google's DNS directly
dig @10.0.0.53 sector-db.internal A    # query internal DNS for internal record

# +trace: trace the full delegation chain from root to authoritative
dig +trace example.com A
# Shows: root servers -> .com TLD -> authoritative NS -> final A record
# Use this when delegation is broken or records are cached incorrectly

# +norecurse: ask the server NOT to recurse (check if it holds the answer)
dig +norecurse @ns1.example.com example.com A

# Check TTL on a record (how long it will be cached)
dig example.com A | grep 'IN.*A'
# example.com.  300  IN  A  93.184.216.34
#               ^^^  = TTL in seconds (300 = 5 minutes)

# DNSSEC validation output
dig +dnssec example.com A

# Show only the question and answer sections
dig +noall +answer example.com

Slide 12 of 31

DNS Debugging: Systematic Approach

A service is unreachable by name but reachable by IP. Follow this sequence.

curl https://api.sector.internal fails with "Could not resolve host." curl https://10.0.100.5 works. DNS is the problem. Narrow it down in 5 steps.

# Step 1: Is the resolver reachable?
dig +short @127.0.0.53 google.com          # test the local stub resolver

# Step 2: What resolver is configured?
resolvectl status | grep 'DNS Servers'

# Step 3: Does the internal resolver know the record?
dig @10.0.0.53 api.sector.internal A       # query internal DNS directly
# NXDOMAIN = record doesn't exist
# SERVFAIL  = resolver can't answer (check its config/logs)

# Step 4: Is the domain in the search list?
resolvectl status | grep 'DNS Domain'
# 'api' alone may not resolve -- needs full FQDN: api.sector.internal

# Step 5: Trace the delegation to find authoritative source
dig +trace api.sector.internal

# Step 6: Flush resolver cache and retry
resolvectl flush-caches
dig +short api.sector.internal

Slide 13 of 31

tcpdump — Packet Capture

See exactly what is on the wire. The definitive ground truth for network debugging.

When to Use tcpdump

When higher-level tools do not explain the problem. When you need to prove what traffic is or is not leaving/arriving. When you need to verify firewall rules are passing packets. When a protocol is misbehaving and you need the exact frames to diagnose it.

Performance Impact

tcpdump captures and processes every matching packet. On a 10 Gbps interface with no filter, it will consume significant CPU. Always use a filter. On production systems, use -c N to limit capture count or -G/-C for file rotation. Never run unflitered on a busy link.

# Capture on an interface with verbose output
tcpdump -i enp3s0 -n

# -i = interface  -n = numeric (no DNS lookup)  -v = verbose  -nn = no name resolution

# Capture to a file for later analysis in Wireshark
tcpdump -i enp3s0 -w /tmp/capture.pcap

# Read a capture file
tcpdump -r /tmp/capture.pcap -n

# Limit capture count
tcpdump -i enp3s0 -n -c 100              # capture 100 packets then exit

Slide 14 of 31

tcpdump: Capture Filters (BPF)

Berkeley Packet Filter syntax. Filter at kernel level before the packet reaches userspace.

# Filter by host IP (src or dst)
tcpdump -i enp3s0 -n host 192.168.10.50

# Filter by source or destination only
tcpdump -i enp3s0 -n src 192.168.10.50
tcpdump -i enp3s0 -n dst 192.168.10.1

# Filter by port
tcpdump -i enp3s0 -n port 443
tcpdump -i enp3s0 -n port 53              # all DNS queries and responses

# Filter by protocol
tcpdump -i enp3s0 -n icmp                 # only ICMP (ping, traceroute, unreachable)
tcpdump -i enp3s0 -n udp                  # only UDP
tcpdump -i enp3s0 -n tcp                  # only TCP

# Combine filters with AND/OR/NOT
tcpdump -i enp3s0 -n host 10.0.100.5 and port 5432   # PostgreSQL to specific host
tcpdump -i enp3s0 -n port 80 or port 443
tcpdump -i enp3s0 -n not port 22                     # exclude SSH from capture

# Filter by network range
tcpdump -i enp3s0 -n net 10.0.100.0/24

Slide 15 of 31

tcpdump: Display Flags and Analysis

Read TCP flags and decode protocol fields from the output.

# Show TCP flags and sequence numbers (-S) and data size (-A)
tcpdump -i enp3s0 -n -S port 443

# TCP flags in output:
# [S]   = SYN   (connection request)
# [S.]  = SYN+ACK (connection accepted)
# [.]   = ACK   (acknowledgement)
# [P.]  = PUSH+ACK (data transfer)
# [F.]  = FIN+ACK (connection close)
# [R]   = RST   (connection reset -- unexpected close)

# Show ASCII content of packets (-A flag)
tcpdump -i enp3s0 -n -A port 80 | head -50

# Show hex + ASCII (-X flag)
tcpdump -i enp3s0 -n -X -c 5 port 53

# Timestamp in human-readable format (-tttt)
tcpdump -i enp3s0 -n -tttt port 443 -c 20

# Write packets to a rolling file set (100MB each)
tcpdump -i enp3s0 -w /var/log/cap-%s.pcap -C 100 -n port 443

Slide 16 of 31

tcpdump: Operational Scenarios

What you actually use tcpdump for during incident response.

# Scenario 1: Is a service receiving any traffic at all?
tcpdump -i enp3s0 -n -c 20 port 8080
# If nothing appears after 10 seconds: traffic is not arriving at this node
# Check routing, firewalls, and load balancer configuration

# Scenario 2: Are database connection attempts being reset?
tcpdump -i enp3s0 -n port 5432 | grep 'R '
# RST packets = connection refused or firewall blocking

# Scenario 3: Are DNS queries leaving the system?
tcpdump -i enp3s0 -n port 53 -c 20
# See both the query and the response to confirm DNS resolution path

# Scenario 4: Monitor a specific connection to diagnose latency
tcpdump -i enp3s0 -n -tttt host 10.0.100.5 and port 5432
# Compare timestamps between SYN and SYN-ACK = network RTT
# Compare PSH to ACK = server processing time

# Scenario 5: Verify ICMP Fragmentation Needed messages (MTU issues)
tcpdump -i enp3s0 -n icmp and icmp[0] = 3

Slide 17 of 31

nmap — Network Mapper

Discover open ports, services, and OS fingerprints. Used for network auditing, not just attacks.

Authorization Required

Never run nmap against systems you do not own or have explicit written permission to scan. Even on your own systems, active scans generate significant traffic and log entries. On production systems, coordinate with the team and use minimal scan types. Unauthorized scanning is illegal in most jurisdictions.

# Basic TCP SYN scan of the most common 1000 ports
nmap -sS 192.168.10.1             # requires root (raw socket)

# TCP connect scan (no raw socket required)
nmap -sT 192.168.10.1

# Scan specific ports
nmap -p 22,80,443,5432 192.168.10.50

# Scan a range of ports
nmap -p 1-1024 192.168.10.50

# Scan all 65535 ports (slow)
nmap -p- 192.168.10.50

# Ping scan only (no port scan): which hosts are up?
nmap -sn 192.168.10.0/24

# Detect service versions
nmap -sV -p 22,80,443 192.168.10.50

Slide 18 of 31

nmap: Port States and Useful Options

Understand what each port state means and how to control scan behavior.

open

A service is actively accepting connections on this port. nmap received a SYN-ACK (or UDP response). This is the definitive answer -- the port is accessible from your scan source.

filtered

nmap cannot determine if the port is open or closed. Probe packets were dropped (no response) or rejected with an ICMP unreachable. A firewall or ACL is blocking the probes. This does NOT mean the port is closed.

closed

No service is listening. nmap received a TCP RST (or ICMP port unreachable for UDP). The host is reachable and the port is accessible but actively refusing connections. The firewall allows the probe but nothing is listening.

# OS detection (requires root)
nmap -O 192.168.10.50

# Aggressive scan: OS detection + service detection + script scan + traceroute
nmap -A 192.168.10.50              # use sparingly, very noisy

# Fast scan: only top 100 ports
nmap -F 192.168.10.50

# Output to file (all formats: normal, XML, greppable)
nmap -sV -p 1-1000 -oA /tmp/scan-results 192.168.10.0/24

Slide 19 of 31

nc — Netcat, the Swiss Army Knife

Test TCP/UDP connectivity without a full client. Faster than nmap for single-port checks.

# Test if a TCP port is open and accepting connections
nc -zv 192.168.10.5 5432
# Connection to 192.168.10.5 5432 port [tcp/postgresql] succeeded!
# -z = zero-I/O (just check if open, don't send data)
# -v = verbose (print success/failure)

# Test with timeout
nc -zvw 3 192.168.10.5 443         # 3 second timeout

# Test a range of ports
nc -zv 192.168.10.5 1-1024 2>&1 | grep 'succeeded'

# UDP port test
nc -zvu 192.168.10.5 53

# Simple file transfer (no encryption -- lab use only)
# Receiver:
nc -l -p 9999 > /tmp/received-file

# Sender:
nc 192.168.10.50 9999 < /tmp/source-file

# Quick HTTP request test (check if web server responds)
echo -e "GET / HTTP/1.0\r\nHost: sector.internal\r\n\r\n" | nc sector.internal 80

Slide 20 of 31

curl for Network and Service Testing

Test HTTP/HTTPS endpoints, measure response times, and diagnose TLS problems.

# Basic GET request with timing
curl -v https://sector-api.internal/health

# Detailed timing breakdown
curl -o /dev/null -s -w "\
DNS:     %{time_namelookup}s\n\
Connect: %{time_connect}s\n\
TLS:     %{time_appconnect}s\n\
TTFB:    %{time_starttransfer}s\n\
Total:   %{time_total}s\n" https://sector-api.internal/health

# Skip TLS certificate verification (debug only -- never in production)
curl -k https://sector-api.internal/health

# Follow redirects
curl -L http://sector.internal

# Send headers with request
curl -H "Authorization: Bearer $TOKEN" https://sector-api.internal/status

# Test connectivity to a specific IP bypassing DNS
curl --resolve sector-api.internal:443:10.0.100.5 https://sector-api.internal/health

Slide 21 of 31

MTU Troubleshooting: The Silent Killer

MTU mismatches produce the most confusing symptoms: small packets work, large ones silently fail.

SSH works. Short curl responses work. But downloading a large file hangs after a few KB, and web pages only partially load. Classic MTU black hole: the path has a device that drops oversized packets AND does not send ICMP Fragmentation Needed back.

Symptom Pattern

Small data transfers work perfectly. Large transfers hang at a consistent byte count. SSH connects but hangs when displaying a directory listing. Web pages begin loading then freeze. Ping works. These are all consistent with a broken PMTUD path.

Path MTU Discovery (PMTUD)

TCP tries to discover the path MTU by sending packets with the DF bit set. When a packet is too large for a hop, that hop sends back ICMP Type 3 Code 4 (Fragmentation Needed). If a firewall blocks this ICMP message, PMTUD fails silently and TCP retransmits forever.

# Step 1: Test if large ping packets are dropped
ping -M do -s 1472 -c 3 192.168.10.1     # 1472 + 28 = 1500 (standard MTU)
ping -M do -s 1400 -c 3 192.168.10.1     # try smaller if 1472 fails

# Step 2: Capture ICMP to see if Fragmentation Needed comes back
tcpdump -i enp3s0 -n icmp and icmp[0] = 3   # Type 3 = Destination Unreachable

# Step 3: Clamp MSS to work around the black hole (iptables)
iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

Slide 22 of 31

conntrack: Connection Tracking

Inspect the kernel's connection tracking table. Essential for firewall and NAT debugging.

# Install conntrack tools
apt-get install -y conntrack

# Show all tracked connections
conntrack -L

# Show only established TCP connections
conntrack -L -p tcp --state ESTABLISHED

# Show only connections to a specific IP
conntrack -L -d 192.168.10.50

# Count total tracked connections
conntrack -L | wc -l

# Check connection tracking table size limit
sysctl net.netfilter.nf_conntrack_max
sysctl net.netfilter.nf_conntrack_count    # current count

# Flush all connection tracking entries (use with care)
conntrack -F

conntrack Table Full

When nf_conntrack_count reaches nf_conntrack_max, the kernel drops new connections and logs "nf_conntrack: table full, dropping packet." Increase the max: sysctl -w net.netfilter.nf_conntrack_max=262144. This is a common cause of intermittent "connection refused" under high traffic.

Slide 23 of 31

Network Performance: iperf3 and sar

Measure actual throughput between two nodes. Verify bandwidth before blaming the application.

# iperf3: TCP throughput test
# On the server (receiver):
iperf3 -s                               # listen on port 5201

# On the client (sender):
iperf3 -c 192.168.10.50               # 10 second test
iperf3 -c 192.168.10.50 -t 30         # 30 second test
iperf3 -c 192.168.10.50 -P 4          # 4 parallel streams

# UDP test (test packet loss and jitter)
iperf3 -c 192.168.10.50 -u -b 100M    # send at 100 Mbps UDP

# Reverse test: server sends to client
iperf3 -c 192.168.10.50 -R

---
# sar: historical and live network interface statistics
sar -n DEV 2 10                          # every 2 seconds, 10 samples
# rxkB/s = receive rate in KB/s
# txkB/s = transmit rate in KB/s
# rxerr/s = receive errors per second (NIC hardware problem indicator)

Slide 24 of 31

ARP Inspection: Spotting Spoofing

Detect ARP cache poisoning and gateway impersonation on the local segment.

# View the current ARP cache
ip neigh show

# Monitor ARP activity in real time with tcpdump
tcpdump -i enp3s0 -n arp

# Check if two entries share the same MAC (ARP spoofing indicator)
ip neigh show | awk '{print $5}' | sort | uniq -d
# If a MAC appears twice with different IPs, investigate immediately

# arping: send ARP requests and display all MAC addresses that respond
arping -c 5 192.168.10.1              # how many different MACs respond to the gateway IP?
# More than one = ARP spoofing in progress

# arpwatch: daemon that logs ARP changes (install separately)
apt-get install -y arpwatch
systemctl start arpwatch
journalctl -u arpwatch -f             # monitor for "changed ethernet address" events

ARP Spoofing Impact

If an attacker poisons your ARP cache, traffic intended for the gateway is sent to the attacker's machine instead. The attacker can then forward the traffic (man-in-the-middle) or drop it (denial of service). Static ARP entries for the gateway are the most effective mitigation on critical hosts.

Slide 25 of 31

Whole-System Network View

One-liner commands that give you the complete picture fast during an incident.

# Complete interface status (addresses + states)
ip -4 -brief addr

# Routing table summary
ip route show

# All listening services (what is exposed)
ss -tlnp

# All established TCP connections (who is connected)
ss -tnp state established

# Count connections per remote IP (spot unexpected traffic sources)
ss -tn state established | awk 'NR>1 {print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn

# DNS configuration
resolvectl status

# ARP cache
ip neigh show

# Network errors since boot
ip -s link show enp3s0 | grep -A4 'RX\|TX'

# 30-second bandwidth snapshot
sar -n DEV 5 6 | grep enp3s0

Slide 26 of 31

Packet Loss Diagnosis

Confirm packet loss, localize it to a segment, and distinguish hardware from congestion.

# Step 1: Confirm packet loss exists
ping -c 100 -q 8.8.8.8               # -q = quiet, summary only
# 100 packets transmitted, 97 received, 3% packet loss

# Step 2: Is it local (to gateway) or remote?
ping -c 100 -q 192.168.10.1           # gateway only
# If gateway has no loss but 8.8.8.8 does: loss is upstream, not your fault
# If gateway has loss too: local link problem

# Step 3: Check NIC hardware errors
ip -s link show enp3s0
# RX: errors N   dropped N   missed N  -- hardware errors suggest bad NIC/cable

ethtool -S enp3s0 | grep -E '(error|drop|crc|fifo|miss)'

# Step 4: Check kernel ring buffer drops (soft drops)
cat /proc/net/dev | grep enp3s0
# Columns include drop counts for RX and TX

# Step 5: Check for buffer overflows
sysctl net.core.netdev_max_backlog      # input queue depth

Slide 27 of 31 | Scenario Lab

Full Scenario: Service Unreachable Investigation

A complete, ordered investigation from report to root cause.

Alert: sector-api.internal is returning "connection refused" to monitoring. The API team says their code is running. You have 5 minutes.

# 1. Is the service listening at all?
ss -tlnp | grep ':8080'
# If nothing: service crashed or wrong port in config

# 2. If listening, can we connect locally?
nc -zv 127.0.0.1 8080
# If no: firewall on loopback (unusual), or service only listening on a specific IP

# 3. What IP is the service bound to?
ss -tlnp | grep ':8080'
# 127.0.0.1:8080 = only accessible locally (not from outside!)
# 0.0.0.0:8080   = accessible on all interfaces
# 10.0.100.5:8080 = specific interface only

# 4. Firewall blocking?
iptables -L INPUT -n -v --line-numbers
nft list ruleset

# 5. Can the monitoring system actually route to us?
ip route get 10.0.200.1              # 10.0.200.1 = monitoring server

# 6. Packet capture: are the monitoring probes even arriving?
tcpdump -i enp3s0 -n -c 20 src 10.0.200.1

Slide 28 of 31 | Connection Analysis

Inventorying Connections: Who Is Talking to Whom

Two questions, two tool families: what is connected right now, and which hosts have been exchanging traffic.

Live Socket Inventory

ss and the legacy netstat list every active connection and the process that owns it; lsof -i answers "who is using this port / this host" from the open-file side. Use these to confirm what is connected at this instant.

Traffic Analysis

traffic-vis is the classic program named for this job: it digests captured traffic and reports which hosts communicated and how much, as text, HTML, or a PostScript graph. It is legacy — not shipped on modern Ubuntu — so today reach for iftop and nethogs, which give the same per-host / per-process picture live.

# --- Current connections (live socket table) ---
ss -tnp                          # all TCP connections + owning process (modern)
netstat -tnp                     # same view, legacy net-tools
ss -tnp dst 203.0.113.50         # connections to one specific host
lsof -i @203.0.113.50            # open sockets to that host
lsof -i :443                     # who is using port 443

# --- Traffic analysis: which hosts are talking, and how much ---
traffic-vis                       # LEGACY: analyzes connections to specific hosts (not on modern Ubuntu)
iftop -i eth0                     # modern: live bandwidth per connection / host
nethogs eth0                      # modern: live bandwidth per process

Slide 29 of 31 | Lab Exercises

Practice Exercises

Complete on your Ubuntu 22.04 lab VM before leaving the lab.

1 Use ss -tulnp to find every process listening on a TCP port. For each of the top 5 services you find, identify the binary using the PID and ps aux | grep PID. Document port, process name, and what that service does.

2 Use dig with +trace to trace the full DNS delegation chain for cloudflare.com. Identify each level: root server, TLD server, and authoritative server. Note the TTL values at each level.

3 Run tcpdump on your primary interface capturing only port 53 traffic. In another terminal, run dig google.com. Read the tcpdump output and identify the DNS query packet and response packet. Note the source and destination IPs and ports.

4 Perform an MTU path test to your default gateway: start at 1472 bytes with the DF bit set. If that works, try 8972. If 8972 fails, find the largest size that succeeds using binary search. Identify the effective path MTU.

5 Run nmap -sT -p 1-1024 127.0.0.1. Review every open port. For each open port, use ss -tlnp to identify the process. Look up any port you don't recognize and document whether it should be open.

Slide 30 of 31

What's Next

Week 1 complete. You have the operational baseline. Week 2 builds security on top of it.

Week 2: Firewall and Filtering

nftables (the modern iptables replacement), firewalld, ufw. Every port you found open with nmap today becomes a decision: should this be accessible? From where? Through what filter? Week 2 answers those questions operationally.

Week 3: DNS and DHCP Servers

Run Bind9 or Unbound for internal DNS. Run ISC-DHCP or Kea. The dig queries you ran today will be aimed at servers you built. Understanding the client-side (this module) makes building the server-side intuitive.

Week 4: Monitoring and Alerting

Prometheus node_exporter, Grafana, alert routing. The ss, sar, and ethtool data you collected manually today can be automated and alerted on. Week 4 makes your diagnostics continuous and proactive.

Slide 31 of 31 | ALA-04

ALA-04 Summary: Key Takeaways

You now have a complete diagnostic toolkit for any network problem. The methodology matters as much as the tools: work from Layer 1 up, confirm each layer before ascending, and let the evidence guide you -- not assumptions. Every tool in this module produces ground truth. Trust the output.

8 Facts to Carry Out of This Lecture

1 ss -tulnp is your first command at any incident. It shows what is listening, on what port, and which process owns it. Run it before anything else.

2 ip route get <IP> shows which interface and gateway the kernel would use for that destination. This single command resolves 30% of routing problems instantly.

3 Stars (* * *) in traceroute do not mean the path is broken. They mean that hop does not reply to ICMP probes. If the next hop responds, routing is fine through that hop.

4 dig @server hostname queries a specific DNS server directly, bypassing your system resolver. Use this to distinguish between a broken record and a broken resolver.

5 tcpdump with no filter on a busy interface will cause CPU saturation. Always use a filter (host X, port Y, proto Z). Always use -n to skip DNS resolution.

6 nc -zv host port tests TCP reachability in under 1 second. Use it before launching tcpdump or nmap -- faster and less disruptive.

7 MTU black holes: small packets succeed, large packets hang. Test with ping -M do -s 1472. Fix with MSS clamping if ICMP Fragmentation Needed is being blocked.

8 nmap open = port reachable. nmap filtered = firewall blocking probes (may be open). nmap closed = host reachable, nothing listening. "filtered" does not mean "closed."