Identify common performance issues and their causes
Describe high availability concepts and implementations
Network Documentation
Documentation is the foundation of network operations. Without accurate documentation, troubleshooting takes longer, changes introduce risk, and knowledge walks out the door when staff leave. Every network needs both physical and logical diagrams.
Physical DiagramsShow actual cable runs, rack locations, port assignments, and hardware placement. Layer 1 and Layer 2 views -- where things physically are and how they connect.
Logical DiagramsShow IP addressing, VLANs, subnets, routing topology, and traffic flow. Layer 3 view -- how traffic moves through the network regardless of physical layout.
Additional documentation includes rack diagrams (U-positions of equipment), cable maps and wiring diagrams, and a complete asset inventory covering hardware models, software versions, licensing, and warranty expiration dates.
/* Key Documentation Types */Rack Diagram Equipment placement in each rack (U1-U42)
Cable Map Port-to-port cable connections, labels, types
Asset Inventory Hardware, software, licenses, warranty dates
IPAM IP Address Management -- tracks all assigned IPs
Baseline What "normal" looks like -- CPU, bandwidth, latency
/* Baselines are critical -- you can't identify abnormal
behavior if you don't know what normal looks like */
Change Management
Unauthorized or undocumented changes cause the majority of network outages. Change management is a formal process that ensures every modification is reviewed, approved, tested, and documented before implementation.
/* The 7-Step Change Management Process */1.Submit change request// Describe what, why, and impact2.Review and assess risk// Classify as Low / Medium / High risk3.Plan rollback procedure// How to undo the change if it fails4.Get CAB approval// Change Advisory Board reviews and approves5.Schedule maintenance window// Minimize impact on users6.Implement the change// Execute during approved window7.Verify and document// Confirm success, update all documentation
Configuration management tracks device configurations over time. The running config is what the device is currently using. The golden config (or baseline config) is the approved, known-good configuration that the device should match.
Service Level Agreements (SLAs)
SLAs define contractual commitments: uptime guarantees (e.g., 99.999% = 5.26 minutes downtime/year), response time (how fast support acknowledges an issue), and resolution time (how fast the issue is fixed).
Network Monitoring -- SNMP
Simple Network Management Protocol (SNMP) is the standard for monitoring and managing network devices. It allows a central management station to query devices for performance data and receive alerts when problems occur.
/* SNMP Components */Manager (NMS) Network Management Station -- polls devices
Agent Software on the managed device -- responds to queries
MIB Management Information Base -- database of objects
/* SNMP Operations */GET Manager requests a value from the agent
SET Manager changes a value on the agent
TRAP Agent sends unsolicited alert to the manager
/* Ports */
UDP 161 -- Queries (GET/SET)
UDP 162 -- Traps
Version
Authentication
Encryption
Notes
SNMPv1
Community string
None
Legacy, insecure
SNMPv2c
Community string
None
Most common, still insecure
SNMPv3
Username/password
AES/DES
Recommended, secure
Traps vs Informs
Traps are fire-and-forget -- the agent sends the alert but never knows if the manager received it. Informs require acknowledgment from the manager, providing reliable notification at the cost of more overhead.
Syslog
Syslog is a centralized logging protocol that collects log messages from network devices and servers. A syslog collector (or syslog server) aggregates these messages for analysis, alerting, and compliance. Default port is UDP 514.
Level
Keyword
Description
0
Emergency
System unusable
1
Alert
Immediate action needed
2
Critical
Critical conditions
3
Error
Error conditions
4
Warning
Warning conditions
5
Notice
Normal but significant
6
Informational
Informational messages
7
Debug
Debug-level messages
SIEM (Security Information and Event Management) takes log aggregation further by correlating events across multiple devices, applying rules to detect patterns, and generating security alerts. A SIEM can identify attacks that no single device log would reveal on its own.
Flow Data & Packet Capture
Understanding network traffic requires different tools depending on the level of detail needed. Flow data shows traffic patterns and trends. Packet capture provides full content for deep analysis.
NetFlow / sFlow / IPFIXTraffic metadata -- who talked to whom, which ports, how much data. Great for identifying top talkers, bandwidth trends, and anomalies without capturing content.
Packet CaptureFull packet content captured with tools like Wireshark or tcpdump. Required for deep protocol analysis and troubleshooting application-layer issues.
/* Traffic Capture Methods */Port Mirroring (SPAN)
Switch copies traffic from one port to another for analysis
Configured on the switch -- no additional hardware needed
Can impact switch performance under heavy load
TAP (Test Access Point)
Inline hardware device that copies traffic passively
No impact on network performance
More reliable than SPAN -- captures everything including errors
/* When to use each */Flow data --> Trends, capacity planning, top talkers
Capture --> Deep diagnosis, protocol analysis, forensics
Disaster Recovery
Disaster recovery (DR) planning defines how an organization recovers from catastrophic failures. Four key metrics measure recovery capability:
Metric
Full Name
Meaning
RPO
Recovery Point Objective
Max acceptable data loss (measured in time)
RTO
Recovery Time Objective
Max acceptable downtime
MTTR
Mean Time To Repair
Average repair duration
MTBF
Mean Time Between Failures
Average uptime between failures
/* DR Site Types (cost vs recovery speed) */Hot Site Fully operational duplicate -- immediate failover
$$$$ Most expensive, fastest recovery
Warm Site Hardware ready, data needs restoration
$$$ Moderate cost, hours to recover
Cold Site Empty facility, everything needs setup
$$ Cheapest, days/weeks to recover
Backup TypesFull: copies everything (longest backup, fastest restore). Differential: copies changes since last full. Incremental: copies changes since last backup of any type (fastest backup, slowest restore).
DR TestingTabletop exercise: discussion-based walkthrough. Simulation: test procedures without actual failover. Full failover test: actually switch to the DR site to verify everything works.
High Availability
High availability (HA) eliminates single points of failure so that services remain operational even when components fail. The goal is to maximize uptime through redundancy at every layer.
Active-ActiveBoth systems handle traffic simultaneously. If one fails, the other absorbs the full load. Better resource utilization but more complex configuration.
Active-PassiveOne system handles traffic while the standby monitors. On failure, the passive system takes over. Simpler but the standby is idle during normal operations.
/* High Availability Components */NIC Teaming / Link Aggregation
Combine multiple NICs into one logical link
Provides redundancy and increased bandwidth (LACP)
Clustering
Multiple servers act as one logical system
If one node fails, others continue serving requests
Load Balancing
Distributes traffic across multiple servers
Algorithms: round-robin, least connections, weighted
Power Redundancy
Dual power supplies (PSU) in each device
UPS (Uninterruptible Power Supply) for short outages
Generators for extended power failures
Performance Monitoring
Effective network operations require continuous monitoring of key performance metrics. Understanding these metrics helps identify issues before they impact users and validates that the network meets its SLAs.
Metric
What It Measures
Why It Matters
Latency
Round-trip time (measured by ping)
High latency degrades user experience
Jitter
Variation in latency over time
Critical for VoIP and video -- causes choppy audio
Bandwidth
Maximum theoretical capacity
The pipe size -- 1 Gbps link capacity
Throughput
Actual data transfer rate achieved
Always less than bandwidth due to overhead
Packet Loss
Percentage of dropped packets
Even 1-2% causes noticeable degradation
/* QoS (Quality of Service) */DSCP Marking Tags packets with priority levels
Priority Queuing High-priority traffic (VoIP) sent first
Traffic Shaping Smooths out bursty traffic to match link capacity
Congestion Mgmt Decides which packets to drop when overloaded
/* Anomaly Alerting */
Set thresholds based on baseline documentation
Alert when metrics deviate from normal ranges
Examples: CPU > 90%, link utilization > 80%, packet loss > 1%