Grid Diagnostics | Advanced Linux Administration

Slide 1 of 30  |  ALA-04  |  Week 1 of 8
Grid Diagnostics
Network Troubleshooting Tools
ss  •  ip route  •  traceroute  •  dig  •  tcpdump  •  nmap  •  DNS Debugging
Sector command reports a degraded connection. Services are running. The network interface is up. Something between here and the remote node is broken. This module is the toolkit you use to find out exactly what, exactly where, and exactly how to prove it.
30 Slides ALA-04 Week 1 of 8 Ubuntu 22.04 LTS
Slide 2 of 30
Diagnostic Methodology
Work from the bottom of the stack up. Never skip a layer.
L1: Physical ip link / ethtool L2-3: Network ping / ip route / arp L4-7: Service ss / dig / tcpdump
Layer 1: Physical
Is the interface up? Is there a carrier? ip link show for state, ethtool for speed/duplex. Cable seated, switch port active? You cannot fix a Layer 3 problem if Layer 1 is broken.
Layers 2-3: Local Connectivity
Can you reach the default gateway? Ping it. If yes, the local link is good. If no, check ARP (ip neigh), IP address, subnet mask, and routing table. Most "network is down" incidents are Layer 3 misconfigurations on the local host.
Layers 4-7: Service Connectivity
Can you reach the remote service? Is the port open? Is DNS resolving? Use ss, dig, nc, and curl to isolate between transport and application problems. tcpdump reveals what is actually on the wire.
The Golden Rule
Confirm what you find at each layer before moving up. If you cannot ping the gateway, do not dig into DNS. If DNS is resolving, do not run tcpdump for connection problems -- start with ss and nc instead. Systematic saves time. Jumping around wastes it.
Slide 3 of 30
ss — Socket Statistics
The modern replacement for netstat. Faster, more detailed, actively maintained.
Why ss Replaced netstat
netstat reads from /proc/net/tcp sequentially. On a busy server with thousands of connections, it is slow. ss uses netlink sockets directly with kernel socket state. It is significantly faster and exposes more detail, including socket memory and TCP state internals.
What ss Shows
All sockets: TCP, UDP, Unix domain sockets. For each: state (LISTEN, ESTABLISHED, TIME_WAIT, etc.), local address:port, peer address:port, process that owns the socket, and socket buffer sizes. All the information you need to verify a service is actually listening where you expect.
State Recv-Q Send-Q Local Addr:Port Peer Addr:Port ESTAB 0 0 10.0.100.5:443 203.0.113.50:52814 users:(("nginx",pid=1842,fd=12))
# Show all listening TCP and UDP sockets with process info ss -tuln # -t = TCP -u = UDP -l = listening only -n = numeric (no DNS lookup) # Show all established TCP connections with process names ss -tp state established # Show all sockets (listening + connected) with process info ss -tulnp # Show only sockets on a specific port ss -tnp sport = :443 # Show all sockets to/from a specific IP ss -tn dst 203.0.113.50
Slide 4 of 30
ss: Filters and TCP States
Narrow the output to exactly the connections you are investigating.
LISTEN SYN recv SYN_RECV ACK ESTABLISHED FIN TIME_WAIT CLOSED
# Filter by TCP state ss -tn state time-wait # sockets in TIME_WAIT ss -tn state established # active connections ss -tn state syn-recv # SYN received, not yet ACK'd (SYN flood indicator) # Multiple state filters ss -tn state time-wait state close-wait # Count connections by state (great for health checks) ss -tn | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn # Show Unix domain sockets ss -xl # Unix listening sockets ss -xp # Unix sockets with process info # Show socket memory (send/recv buffer usage) ss -tm state established # Find which process is listening on a specific port ss -tlnp | grep ':8080'
Slide 5 of 30
ss: Practical Investigation Scenarios
Real questions you answer with ss during incident response.
# Q: Is nginx actually listening on port 443? ss -tlnp | grep nginx ss -tlnp | grep ':443' # Q: How many clients are currently connected to this service? ss -tn state established dst :443 | wc -l # Q: Is there a SYN flood happening? ss -tn state syn-recv | wc -l # normal = 0; hundreds = flood # Q: Are there too many TIME_WAIT sockets? ss -tn state time-wait | wc -l # over 5000 may indicate connection thrashing # Q: What is consuming all the file descriptors (fd)? ss -tp | awk '{print $NF}' | grep -oP 'pid=\K[0-9]+' | \ sort | uniq -c | sort -rn | head # Q: Is the database accepting connections from the API? ss -tnp state established dst 10.0.100.5 sport = :5432
Slide 6 of 30
ip route: Routing Table Diagnostics
Confirm the kernel will send packets where you expect. Routing surprises cause most connectivity failures.
Packet dst: 8.8.8.8 Kernel Route Table Lookup longest prefix match Result via 192.168.10.1 dev enp3s0 src 192.168.10.50 interface + gateway + source IP
# Show the complete routing table ip route show # The single most useful routing diagnostic: show which route would be used for a destination ip route get 8.8.8.8 # 8.8.8.8 via 192.168.10.1 dev enp3s0 src 192.168.10.50 uid 1000 # This tells you: outgoing interface (enp3s0), next-hop (192.168.10.1), source IP ip route get 10.0.100.5 # If this says "unreachable" you have a missing route # Show all routing tables (policy routing) ip route show table all # Show routing policy rules ip rule show # Show cache (recent route lookups) ip route show cache # Check if there are multiple default gateways (routing conflict) ip route show | grep '^default' # Multiple "default via" lines = potential conflict
Slide 7 of 30
ping: Beyond "Is it alive?"
ping tells you RTT, packet loss, TTL, and path problems. Use its flags.
Client 10.0.100.5 Server 8.8.8.8 ICMP Echo Request (type 8) ICMP Echo Reply (type 0) RTT = request + reply time (e.g. 1.8ms)
# Basic ping with limited count ping -c 4 192.168.10.1 # Flood ping: send as fast as possible (root required) ping -f -c 1000 192.168.10.1 # stress test; . = sent, backspace = received # Set specific packet size (test MTU path) ping -s 1400 -c 10 192.168.10.1 # 1400 byte payload # Don't fragment (DF bit): MTU path discovery ping -M do -s 8972 -c 3 192.168.10.1 # fails if any hop has MTU < 9000 # Set TTL: probe specific hops ping -t 1 8.8.8.8 # TTL=1: reply from first hop (gateway) ping -t 2 8.8.8.8 # TTL=2: reply from second hop # IPv6 ping ping6 -c 4 2001:4860:4860::8888 # Interpret output: # rtt min/avg/max/mdev = 1.2/1.8/2.4/0.4 ms # mdev (mean deviation) > 10ms with avg < 5ms = jitter problem # packet loss > 0% = congestion, hardware fault, or firewall drop
Slide 8 of 30
traceroute and tracepath
Map the path from your node to a destination. Identify where packets die or slow down.
Client TTL=4 Router 1 TTL=3 1.2ms Router 2 TTL=2 8.4ms Router 3 TTL=1 9.1ms Destination 8.8.8.8 10.2ms Each router sends ICMP TTL Exceeded when TTL reaches 0
traceroute
Sends packets with increasing TTL values (1, 2, 3...). Each router that decrements TTL to 0 sends an ICMP TTL Exceeded reply, revealing its address and RTT. Default uses UDP on high ports. Use -I for ICMP, -T for TCP SYN (bypasses some firewalls).
tracepath
Similar to traceroute but also discovers Path MTU. Does not require root (unlike traceroute with raw socket options). Less configurable but sufficient for most diagnostics. Shows MTU at each hop, which helps diagnose fragmentation problems.
# Standard traceroute (UDP by default) traceroute 8.8.8.8 # ICMP mode (like ping -- less likely blocked by firewalls on some hops) traceroute -I 8.8.8.8 # TCP SYN mode on port 443 (evades firewalls that block ICMP/UDP) traceroute -T -p 443 8.8.8.8 # Numeric only (no DNS lookups -- much faster) traceroute -n 8.8.8.8 # Set initial TTL and max hops traceroute -f 5 -m 20 8.8.8.8 # start at TTL 5, max 20 hops # tracepath: path MTU discovery included tracepath 8.8.8.8 tracepath6 2001:4860:4860::8888 # IPv6 path
Slide 9 of 30
Reading Traceroute Output
Stars, latency jumps, and asymmetric paths explained.
# Sample output with annotations: traceroute to 8.8.8.8, 30 hops max 1 192.168.10.1 1.2 ms 0.9 ms 1.1 ms # gateway, ~1ms = LAN 2 10.1.1.1 8.4 ms 8.1 ms 8.2 ms # ISP first hop 3 * * * # ICMP filtered -- hop exists but won't reply 4 72.14.194.1 9.1 ms 9.3 ms 9.0 ms # Google edge, still normal 5 8.8.8.8 10.2 ms 9.9 ms 10.1 ms # destination
Stars (* * *)
Three stars means no reply within the timeout. The hop may exist but is configured not to reply to ICMP (common on ISP routers). Does NOT mean packets are dropped for real traffic -- the probe packets are different from actual data. A star followed by a responding hop is normal.
Sudden Latency Jump
A big increase (e.g., 1 ms to 80 ms) between two consecutive hops indicates where the slow link or congested segment is. The subsequent hops that show similar latency confirm the bottleneck is at that link, not oscillating. Pinpoint the segment, then investigate the cause.
Stars to the End
If all hops after a certain point return stars and you never reach the destination, packets are being dropped at or after that hop. Could be a firewall rule, a black hole route, or a routing loop. Compare to a traceroute from a different source to triangulate.
Slide 10 of 30
dig — DNS Interrogation Tool
The authoritative tool for DNS queries. Shows the full server response, not just the answer.
Client dig A Resolver 8.8.8.8 Root (.) a.root-servers TLD (.com) a.gtld-servers Authoritative ns1.example.com Answer: 93.184.216.34 (TTL 300)
Why dig over nslookup
nslookup is deprecated and inconsistent. dig is the standard for DNS diagnostics. It shows the complete DNS response: QUESTION, ANSWER, AUTHORITY, and ADDITIONAL sections, plus response flags (QR, AA, TC, RD, RA), TTL values, and the responding server address.
Reading dig Output
The ANSWER section contains the records returned. NOERROR = query processed. NXDOMAIN = name does not exist. SERVFAIL = DNS server error. REFUSED = server refused the query. AA flag = Authoritative Answer from the zone's own server.
# Query A record (IPv4 address) dig example.com A # Short output: answer only dig +short example.com # Query a specific record type dig example.com MX # mail exchanger records dig example.com AAAA # IPv6 address records dig example.com NS # authoritative name servers dig example.com TXT # text records (SPF, DKIM, verification tokens) dig example.com SOA # Start of Authority (zone metadata) # Reverse lookup: IP to hostname dig -x 8.8.8.8 # PTR record lookup
Slide 11 of 30
dig: Advanced Query Options
Query specific servers, trace the resolution chain, and control output format.
# Query a specific DNS server (not your default resolver) dig @8.8.8.8 example.com # query Google's DNS directly dig @10.0.0.53 sector-db.internal A # query internal DNS for internal record # +trace: trace the full delegation chain from root to authoritative dig +trace example.com A # Shows: root servers -> .com TLD -> authoritative NS -> final A record # Use this when delegation is broken or records are cached incorrectly # +norecurse: ask the server NOT to recurse (check if it holds the answer) dig +norecurse @ns1.example.com example.com A # Check TTL on a record (how long it will be cached) dig example.com A | grep 'IN.*A' # example.com. 300 IN A 93.184.216.34 # ^^^ = TTL in seconds (300 = 5 minutes) # DNSSEC validation output dig +dnssec example.com A # Show only the question and answer sections dig +noall +answer example.com
Slide 12 of 30
DNS Debugging: Systematic Approach
A service is unreachable by name but reachable by IP. Follow this sequence.
curl https://api.sector.internal fails with "Could not resolve host." curl https://10.0.100.5 works. DNS is the problem. Narrow it down in 5 steps.
Resolver reachable? Record exists? Search domain set? Trace deleg. Flush cache dig @127.0.0.53 dig @10.0.0.53 resolvectl dig +trace flush
# Step 1: Is the resolver reachable? dig +short @127.0.0.53 google.com # test the local stub resolver # Step 2: What resolver is configured? resolvectl status | grep 'DNS Servers' # Step 3: Does the internal resolver know the record? dig @10.0.0.53 api.sector.internal A # query internal DNS directly # NXDOMAIN = record doesn't exist # SERVFAIL = resolver can't answer (check its config/logs) # Step 4: Is the domain in the search list? resolvectl status | grep 'DNS Domain' # 'api' alone may not resolve -- needs full FQDN: api.sector.internal # Step 5: Trace the delegation to find authoritative source dig +trace api.sector.internal # Step 6: Flush resolver cache and retry resolvectl flush-caches dig +short api.sector.internal
Slide 13 of 30
tcpdump — Packet Capture
See exactly what is on the wire. The definitive ground truth for network debugging.
tcpdump sees: L2 Ethernet src/dst MAC L3 IP Header src/dst IP, TTL L4 TCP/UDP ports, flags, seq Payload (Data) HTTP, DNS, TLS... -A shows ASCII payload | -X shows hex + ASCII | -e shows L2 headers
When to Use tcpdump
When higher-level tools do not explain the problem. When you need to prove what traffic is or is not leaving/arriving. When you need to verify firewall rules are passing packets. When a protocol is misbehaving and you need the exact frames to diagnose it.
Performance Impact
tcpdump captures and processes every matching packet. On a 10 Gbps interface with no filter, it will consume significant CPU. Always use a filter. On production systems, use -c N to limit capture count or -G/-C for file rotation. Never run unflitered on a busy link.
# Capture on an interface with verbose output tcpdump -i enp3s0 -n # -i = interface -n = numeric (no DNS lookup) -v = verbose -nn = no name resolution # Capture to a file for later analysis in Wireshark tcpdump -i enp3s0 -w /tmp/capture.pcap # Read a capture file tcpdump -r /tmp/capture.pcap -n # Limit capture count tcpdump -i enp3s0 -n -c 100 # capture 100 packets then exit
Slide 14 of 30
tcpdump: Capture Filters (BPF)
Berkeley Packet Filter syntax. Filter at kernel level before the packet reaches userspace.
# Filter by host IP (src or dst) tcpdump -i enp3s0 -n host 192.168.10.50 # Filter by source or destination only tcpdump -i enp3s0 -n src 192.168.10.50 tcpdump -i enp3s0 -n dst 192.168.10.1 # Filter by port tcpdump -i enp3s0 -n port 443 tcpdump -i enp3s0 -n port 53 # all DNS queries and responses # Filter by protocol tcpdump -i enp3s0 -n icmp # only ICMP (ping, traceroute, unreachable) tcpdump -i enp3s0 -n udp # only UDP tcpdump -i enp3s0 -n tcp # only TCP # Combine filters with AND/OR/NOT tcpdump -i enp3s0 -n host 10.0.100.5 and port 5432 # PostgreSQL to specific host tcpdump -i enp3s0 -n port 80 or port 443 tcpdump -i enp3s0 -n not port 22 # exclude SSH from capture # Filter by network range tcpdump -i enp3s0 -n net 10.0.100.0/24
Slide 15 of 30
tcpdump: Display Flags and Analysis
Read TCP flags and decode protocol fields from the output.
Client :52814 Server :443 [S] SYN seq=100 [S.] SYN-ACK seq=300 ack=101 [.] ACK ack=301 ESTABLISHED
# Show TCP flags and sequence numbers (-S) and data size (-A) tcpdump -i enp3s0 -n -S port 443 # TCP flags in output: # [S] = SYN (connection request) # [S.] = SYN+ACK (connection accepted) # [.] = ACK (acknowledgement) # [P.] = PUSH+ACK (data transfer) # [F.] = FIN+ACK (connection close) # [R] = RST (connection reset -- unexpected close) # Show ASCII content of packets (-A flag) tcpdump -i enp3s0 -n -A port 80 | head -50 # Show hex + ASCII (-X flag) tcpdump -i enp3s0 -n -X -c 5 port 53 # Timestamp in human-readable format (-tttt) tcpdump -i enp3s0 -n -tttt port 443 -c 20 # Write packets to a rolling file set (100MB each) tcpdump -i enp3s0 -w /var/log/cap-%s.pcap -C 100 -n port 443
Slide 16 of 30
tcpdump: Operational Scenarios
What you actually use tcpdump for during incident response.
# Scenario 1: Is a service receiving any traffic at all? tcpdump -i enp3s0 -n -c 20 port 8080 # If nothing appears after 10 seconds: traffic is not arriving at this node # Check routing, firewalls, and load balancer configuration # Scenario 2: Are database connection attempts being reset? tcpdump -i enp3s0 -n port 5432 | grep 'R ' # RST packets = connection refused or firewall blocking # Scenario 3: Are DNS queries leaving the system? tcpdump -i enp3s0 -n port 53 -c 20 # See both the query and the response to confirm DNS resolution path # Scenario 4: Monitor a specific connection to diagnose latency tcpdump -i enp3s0 -n -tttt host 10.0.100.5 and port 5432 # Compare timestamps between SYN and SYN-ACK = network RTT # Compare PSH to ACK = server processing time # Scenario 5: Verify ICMP Fragmentation Needed messages (MTU issues) tcpdump -i enp3s0 -n icmp and icmp[0] = 3
Slide 17 of 30
nmap — Network Mapper
Discover open ports, services, and OS fingerprints. Used for network auditing, not just attacks.
Authorization Required
Never run nmap against systems you do not own or have explicit written permission to scan. Even on your own systems, active scans generate significant traffic and log entries. On production systems, coordinate with the team and use minimal scan types. Unauthorized scanning is illegal in most jurisdictions.
# Basic TCP SYN scan of the most common 1000 ports nmap -sS 192.168.10.1 # requires root (raw socket) # TCP connect scan (no raw socket required) nmap -sT 192.168.10.1 # Scan specific ports nmap -p 22,80,443,5432 192.168.10.50 # Scan a range of ports nmap -p 1-1024 192.168.10.50 # Scan all 65535 ports (slow) nmap -p- 192.168.10.50 # Ping scan only (no port scan): which hosts are up? nmap -sn 192.168.10.0/24 # Detect service versions nmap -sV -p 22,80,443 192.168.10.50
Slide 18 of 30
nmap: Port States and Useful Options
Understand what each port state means and how to control scan behavior.
open
A service is actively accepting connections on this port. nmap received a SYN-ACK (or UDP response). This is the definitive answer -- the port is accessible from your scan source.
filtered
nmap cannot determine if the port is open or closed. Probe packets were dropped (no response) or rejected with an ICMP unreachable. A firewall or ACL is blocking the probes. This does NOT mean the port is closed.
closed
No service is listening. nmap received a TCP RST (or ICMP port unreachable for UDP). The host is reachable and the port is accessible but actively refusing connections. The firewall allows the probe but nothing is listening.
# OS detection (requires root) nmap -O 192.168.10.50 # Aggressive scan: OS detection + service detection + script scan + traceroute nmap -A 192.168.10.50 # use sparingly, very noisy # Fast scan: only top 100 ports nmap -F 192.168.10.50 # Output to file (all formats: normal, XML, greppable) nmap -sV -p 1-1000 -oA /tmp/scan-results 192.168.10.0/24
Slide 19 of 30
nc — Netcat, the Swiss Army Knife
Test TCP/UDP connectivity without a full client. Faster than nmap for single-port checks.
# Test if a TCP port is open and accepting connections nc -zv 192.168.10.5 5432 # Connection to 192.168.10.5 5432 port [tcp/postgresql] succeeded! # -z = zero-I/O (just check if open, don't send data) # -v = verbose (print success/failure) # Test with timeout nc -zvw 3 192.168.10.5 443 # 3 second timeout # Test a range of ports nc -zv 192.168.10.5 1-1024 2>&1 | grep 'succeeded' # UDP port test nc -zvu 192.168.10.5 53 # Simple file transfer (no encryption -- lab use only) # Receiver: nc -l -p 9999 > /tmp/received-file # Sender: nc 192.168.10.50 9999 < /tmp/source-file # Quick HTTP request test (check if web server responds) echo -e "GET / HTTP/1.0\r\nHost: sector.internal\r\n\r\n" | nc sector.internal 80
Slide 20 of 30
curl for Network and Service Testing
Test HTTP/HTTPS endpoints, measure response times, and diagnose TLS problems.
# Basic GET request with timing curl -v https://sector-api.internal/health # Detailed timing breakdown curl -o /dev/null -s -w "\ DNS: %{time_namelookup}s\n\ Connect: %{time_connect}s\n\ TLS: %{time_appconnect}s\n\ TTFB: %{time_starttransfer}s\n\ Total: %{time_total}s\n" https://sector-api.internal/health # Skip TLS certificate verification (debug only -- never in production) curl -k https://sector-api.internal/health # Follow redirects curl -L http://sector.internal # Send headers with request curl -H "Authorization: Bearer $TOKEN" https://sector-api.internal/status # Test connectivity to a specific IP bypassing DNS curl --resolve sector-api.internal:443:10.0.100.5 https://sector-api.internal/health
Slide 21 of 30
MTU Troubleshooting: The Silent Killer
MTU mismatches produce the most confusing symptoms: small packets work, large ones silently fail.
SSH works. Short curl responses work. But downloading a large file hangs after a few KB, and web pages only partially load. Classic MTU black hole: the path has a device that drops oversized packets AND does not send ICMP Fragmentation Needed back.
Symptom Pattern
Small data transfers work perfectly. Large transfers hang at a consistent byte count. SSH connects but hangs when displaying a directory listing. Web pages begin loading then freeze. Ping works. These are all consistent with a broken PMTUD path.
Path MTU Discovery (PMTUD)
TCP tries to discover the path MTU by sending packets with the DF bit set. When a packet is too large for a hop, that hop sends back ICMP Type 3 Code 4 (Fragmentation Needed). If a firewall blocks this ICMP message, PMTUD fails silently and TCP retransmits forever.
# Step 1: Test if large ping packets are dropped ping -M do -s 1472 -c 3 192.168.10.1 # 1472 + 28 = 1500 (standard MTU) ping -M do -s 1400 -c 3 192.168.10.1 # try smaller if 1472 fails # Step 2: Capture ICMP to see if Fragmentation Needed comes back tcpdump -i enp3s0 -n icmp and icmp[0] = 3 # Type 3 = Destination Unreachable # Step 3: Clamp MSS to work around the black hole (iptables) iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
Slide 22 of 30
conntrack: Connection Tracking
Inspect the kernel's connection tracking table. Essential for firewall and NAT debugging.
# Install conntrack tools apt-get install -y conntrack # Show all tracked connections conntrack -L # Show only established TCP connections conntrack -L -p tcp --state ESTABLISHED # Show only connections to a specific IP conntrack -L -d 192.168.10.50 # Count total tracked connections conntrack -L | wc -l # Check connection tracking table size limit sysctl net.netfilter.nf_conntrack_max sysctl net.netfilter.nf_conntrack_count # current count # Flush all connection tracking entries (use with care) conntrack -F
conntrack Table Full
When nf_conntrack_count reaches nf_conntrack_max, the kernel drops new connections and logs "nf_conntrack: table full, dropping packet." Increase the max: sysctl -w net.netfilter.nf_conntrack_max=262144. This is a common cause of intermittent "connection refused" under high traffic.
Slide 23 of 30
Network Performance: iperf3 and sar
Measure actual throughput between two nodes. Verify bandwidth before blaming the application.
# iperf3: TCP throughput test # On the server (receiver): iperf3 -s # listen on port 5201 # On the client (sender): iperf3 -c 192.168.10.50 # 10 second test iperf3 -c 192.168.10.50 -t 30 # 30 second test iperf3 -c 192.168.10.50 -P 4 # 4 parallel streams # UDP test (test packet loss and jitter) iperf3 -c 192.168.10.50 -u -b 100M # send at 100 Mbps UDP # Reverse test: server sends to client iperf3 -c 192.168.10.50 -R --- # sar: historical and live network interface statistics sar -n DEV 2 10 # every 2 seconds, 10 samples # rxkB/s = receive rate in KB/s # txkB/s = transmit rate in KB/s # rxerr/s = receive errors per second (NIC hardware problem indicator)
Slide 24 of 30
ARP Inspection: Spotting Spoofing
Detect ARP cache poisoning and gateway impersonation on the local segment.
# View the current ARP cache ip neigh show # Monitor ARP activity in real time with tcpdump tcpdump -i enp3s0 -n arp # Check if two entries share the same MAC (ARP spoofing indicator) ip neigh show | awk '{print $5}' | sort | uniq -d # If a MAC appears twice with different IPs, investigate immediately # arping: send ARP requests and display all MAC addresses that respond arping -c 5 192.168.10.1 # how many different MACs respond to the gateway IP? # More than one = ARP spoofing in progress # arpwatch: daemon that logs ARP changes (install separately) apt-get install -y arpwatch systemctl start arpwatch journalctl -u arpwatch -f # monitor for "changed ethernet address" events
ARP Spoofing Impact
If an attacker poisons your ARP cache, traffic intended for the gateway is sent to the attacker's machine instead. The attacker can then forward the traffic (man-in-the-middle) or drop it (denial of service). Static ARP entries for the gateway are the most effective mitigation on critical hosts.
Slide 25 of 30
Whole-System Network View
One-liner commands that give you the complete picture fast during an incident.
# Complete interface status (addresses + states) ip -4 -brief addr # Routing table summary ip route show # All listening services (what is exposed) ss -tlnp # All established TCP connections (who is connected) ss -tnp state established # Count connections per remote IP (spot unexpected traffic sources) ss -tn state established | awk 'NR>1 {print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn # DNS configuration resolvectl status # ARP cache ip neigh show # Network errors since boot ip -s link show enp3s0 | grep -A4 'RX\|TX' # 30-second bandwidth snapshot sar -n DEV 5 6 | grep enp3s0
Slide 26 of 30
Packet Loss Diagnosis
Confirm packet loss, localize it to a segment, and distinguish hardware from congestion.
# Step 1: Confirm packet loss exists ping -c 100 -q 8.8.8.8 # -q = quiet, summary only # 100 packets transmitted, 97 received, 3% packet loss # Step 2: Is it local (to gateway) or remote? ping -c 100 -q 192.168.10.1 # gateway only # If gateway has no loss but 8.8.8.8 does: loss is upstream, not your fault # If gateway has loss too: local link problem # Step 3: Check NIC hardware errors ip -s link show enp3s0 # RX: errors N dropped N missed N -- hardware errors suggest bad NIC/cable ethtool -S enp3s0 | grep -E '(error|drop|crc|fifo|miss)' # Step 4: Check kernel ring buffer drops (soft drops) cat /proc/net/dev | grep enp3s0 # Columns include drop counts for RX and TX # Step 5: Check for buffer overflows sysctl net.core.netdev_max_backlog # input queue depth
Slide 27 of 30  |  Scenario Lab
Full Scenario: Service Unreachable Investigation
A complete, ordered investigation from report to root cause.
Alert: sector-api.internal is returning "connection refused" to monitoring. The API team says their code is running. You have 5 minutes.
1. ss listening? 2. nc connect? 3. Bind 0.0.0.0? 4. FW iptables 5. Route ip route 6. pcap tcpdump Systematic triage: each step narrows the failure domain
# 1. Is the service listening at all? ss -tlnp | grep ':8080' # If nothing: service crashed or wrong port in config # 2. If listening, can we connect locally? nc -zv 127.0.0.1 8080 # If no: firewall on loopback (unusual), or service only listening on a specific IP # 3. What IP is the service bound to? ss -tlnp | grep ':8080' # 127.0.0.1:8080 = only accessible locally (not from outside!) # 0.0.0.0:8080 = accessible on all interfaces # 10.0.100.5:8080 = specific interface only # 4. Firewall blocking? iptables -L INPUT -n -v --line-numbers nft list ruleset # 5. Can the monitoring system actually route to us? ip route get 10.0.200.1 # 10.0.200.1 = monitoring server # 6. Packet capture: are the monitoring probes even arriving? tcpdump -i enp3s0 -n -c 20 src 10.0.200.1
Slide 28 of 30  |  Lab Exercises
Practice Exercises
Complete on your Ubuntu 22.04 lab VM before leaving the lab.
1 Use ss -tulnp to find every process listening on a TCP port. For each of the top 5 services you find, identify the binary using the PID and ps aux | grep PID. Document port, process name, and what that service does.
2 Use dig with +trace to trace the full DNS delegation chain for cloudflare.com. Identify each level: root server, TLD server, and authoritative server. Note the TTL values at each level.
3 Run tcpdump on your primary interface capturing only port 53 traffic. In another terminal, run dig google.com. Read the tcpdump output and identify the DNS query packet and response packet. Note the source and destination IPs and ports.
4 Perform an MTU path test to your default gateway: start at 1472 bytes with the DF bit set. If that works, try 8972. If 8972 fails, find the largest size that succeeds using binary search. Identify the effective path MTU.
5 Run nmap -sT -p 1-1024 127.0.0.1. Review every open port. For each open port, use ss -tlnp to identify the process. Look up any port you don't recognize and document whether it should be open.
Slide 29 of 30
What's Next
Week 1 complete. You have the operational baseline. Week 2 builds security on top of it.
Week 2: Firewall and Filtering
nftables (the modern iptables replacement), firewalld, ufw. Every port you found open with nmap today becomes a decision: should this be accessible? From where? Through what filter? Week 2 answers those questions operationally.
Week 3: DNS and DHCP Servers
Run Bind9 or Unbound for internal DNS. Run ISC-DHCP or Kea. The dig queries you ran today will be aimed at servers you built. Understanding the client-side (this module) makes building the server-side intuitive.
Week 4: Monitoring and Alerting
Prometheus node_exporter, Grafana, alert routing. The ss, sar, and ethtool data you collected manually today can be automated and alerted on. Week 4 makes your diagnostics continuous and proactive.
Slide 30 of 30  |  ALA-04
ALA-04 Summary: Key Takeaways
You now have a complete diagnostic toolkit for any network problem. The methodology matters as much as the tools: work from Layer 1 up, confirm each layer before ascending, and let the evidence guide you -- not assumptions. Every tool in this module produces ground truth. Trust the output.
1 ss -tulnp is your first command at any incident. It shows what is listening, on what port, and which process owns it. Run it before anything else.
2 ip route get <IP> shows which interface and gateway the kernel would use for that destination. This single command resolves 30% of routing problems instantly.
3 Stars (* * *) in traceroute do not mean the path is broken. They mean that hop does not reply to ICMP probes. If the next hop responds, routing is fine through that hop.
4 dig @server hostname queries a specific DNS server directly, bypassing your system resolver. Use this to distinguish between a broken record and a broken resolver.
5 tcpdump with no filter on a busy interface will cause CPU saturation. Always use a filter (host X, port Y, proto Z). Always use -n to skip DNS resolution.
6 nc -zv host port tests TCP reachability in under 1 second. Use it before launching tcpdump or nmap -- faster and less disruptive.
7 MTU black holes: small packets succeed, large packets hang. Test with ping -M do -s 1472. Fix with MSS clamping if ICMP Fragmentation Needed is being blocked.
8 nmap open = port reachable. nmap filtered = firewall blocking probes (may be open). nmap closed = host reachable, nothing listening. "filtered" does not mean "closed."