NE-10: Network Operations & Monitoring

Learning Objectives

Explain the purpose of network documentation and change management
Describe network monitoring technologies (SNMP, syslog, SIEM)
Compare disaster recovery concepts (RPO, RTO, MTTR, MTBF)
Identify common performance issues and their causes
Describe high availability concepts and implementations

Network Documentation

Documentation is the foundation of network operations. Without accurate documentation, troubleshooting takes longer, changes introduce risk, and knowledge walks out the door when staff leave. Every network needs both physical and logical diagrams.

Physical Diagrams Show actual cable runs, rack locations, port assignments, and hardware placement. Layer 1 and Layer 2 views -- where things physically are and how they connect.

Logical Diagrams Show IP addressing, VLANs, subnets, routing topology, and traffic flow. Layer 3 view -- how traffic moves through the network regardless of physical layout.

Additional documentation includes rack diagrams (U-positions of equipment), cable maps and wiring diagrams, and a complete asset inventory covering hardware models, software versions, licensing, and warranty expiration dates.

/* Key Documentation Types */ Rack Diagram Equipment placement in each rack (U1-U42) Cable Map Port-to-port cable connections, labels, types Asset Inventory Hardware, software, licenses, warranty dates IPAM IP Address Management -- tracks all assigned IPs Baseline What "normal" looks like -- CPU, bandwidth, latency /* Baselines are critical -- you can't identify abnormal behavior if you don't know what normal looks like */

Change Management

Unauthorized or undocumented changes cause the majority of network outages. Change management is a formal process that ensures every modification is reviewed, approved, tested, and documented before implementation.

/* The 7-Step Change Management Process */ 1. Submit change request // Describe what, why, and impact 2. Review and assess risk // Classify as Low / Medium / High risk 3. Plan rollback procedure // How to undo the change if it fails 4. Get CAB approval // Change Advisory Board reviews and approves 5. Schedule maintenance window // Minimize impact on users 6. Implement the change // Execute during approved window 7. Verify and document // Confirm success, update all documentation

Configuration management tracks device configurations over time. The running config is what the device is currently using. The golden config (or baseline config) is the approved, known-good configuration that the device should match.

Service Level Agreements (SLAs)

SLAs define contractual commitments: uptime guarantees (e.g., 99.999% = 5.26 minutes downtime/year), response time (how fast support acknowledges an issue), and resolution time (how fast the issue is fixed).

Network Monitoring -- SNMP

Simple Network Management Protocol (SNMP) is the standard for monitoring and managing network devices. It allows a central management station to query devices for performance data and receive alerts when problems occur.

/* SNMP Components */ Manager (NMS) Network Management Station -- polls devices Agent Software on the managed device -- responds to queries MIB Management Information Base -- database of objects /* SNMP Operations */ GET Manager requests a value from the agent SET Manager changes a value on the agent TRAP Agent sends unsolicited alert to the manager /* Ports */ UDP 161 -- Queries (GET/SET) UDP 162 -- Traps

Version	Authentication	Encryption	Notes
SNMPv1	Community string	None	Legacy, insecure
SNMPv2c	Community string	None	Most common, still insecure
SNMPv3	Username/password	AES/DES	Recommended, secure

Traps vs Informs

Traps are fire-and-forget -- the agent sends the alert but never knows if the manager received it. Informs require acknowledgment from the manager, providing reliable notification at the cost of more overhead.

Syslog

Syslog is a centralized logging protocol that collects log messages from network devices and servers. A syslog collector (or syslog server) aggregates these messages for analysis, alerting, and compliance. Default port is UDP 514.

Level	Keyword	Description
0	Emergency	System unusable
1	Alert	Immediate action needed
2	Critical	Critical conditions
3	Error	Error conditions
4	Warning	Warning conditions
5	Notice	Normal but significant
6	Informational	Informational messages
7	Debug	Debug-level messages

SIEM (Security Information and Event Management) takes log aggregation further by correlating events across multiple devices, applying rules to detect patterns, and generating security alerts. A SIEM can identify attacks that no single device log would reveal on its own.

Flow Data & Packet Capture

Understanding network traffic requires different tools depending on the level of detail needed. Flow data shows traffic patterns and trends. Packet capture provides full content for deep analysis.

NetFlow / sFlow / IPFIX Traffic metadata -- who talked to whom, which ports, how much data. Great for identifying top talkers, bandwidth trends, and anomalies without capturing content.

Packet Capture Full packet content captured with tools like Wireshark or tcpdump. Required for deep protocol analysis and troubleshooting application-layer issues.

/* Traffic Capture Methods */ Port Mirroring (SPAN) Switch copies traffic from one port to another for analysis Configured on the switch -- no additional hardware needed Can impact switch performance under heavy load TAP (Test Access Point) Inline hardware device that copies traffic passively No impact on network performance More reliable than SPAN -- captures everything including errors /* When to use each */ Flow data --> Trends, capacity planning, top talkers Capture --> Deep diagnosis, protocol analysis, forensics

Disaster Recovery

Disaster recovery (DR) planning defines how an organization recovers from catastrophic failures. Four key metrics measure recovery capability:

Metric	Full Name	Meaning
RPO	Recovery Point Objective	Max acceptable data loss (measured in time)
RTO	Recovery Time Objective	Max acceptable downtime
MTTR	Mean Time To Repair	Average repair duration
MTBF	Mean Time Between Failures	Average uptime between failures

/* DR Site Types (cost vs recovery speed) */ Hot Site Fully operational duplicate -- immediate failover $$$$ Most expensive, fastest recovery Warm Site Hardware ready, data needs restoration $$$ Moderate cost, hours to recover Cold Site Empty facility, everything needs setup $$ Cheapest, days/weeks to recover

Backup Types Full: copies everything (longest backup, fastest restore). Differential: copies changes since last full. Incremental: copies changes since last backup of any type (fastest backup, slowest restore).

DR Testing Tabletop exercise: discussion-based walkthrough. Simulation: test procedures without actual failover. Full failover test: actually switch to the DR site to verify everything works.

High Availability

High availability (HA) eliminates single points of failure so that services remain operational even when components fail. The goal is to maximize uptime through redundancy at every layer.

Active-Active Both systems handle traffic simultaneously. If one fails, the other absorbs the full load. Better resource utilization but more complex configuration.

Active-Passive One system handles traffic while the standby monitors. On failure, the passive system takes over. Simpler but the standby is idle during normal operations.

/* High Availability Components */ NIC Teaming / Link Aggregation Combine multiple NICs into one logical link Provides redundancy and increased bandwidth (LACP) Clustering Multiple servers act as one logical system If one node fails, others continue serving requests Load Balancing Distributes traffic across multiple servers Algorithms: round-robin, least connections, weighted Power Redundancy Dual power supplies (PSU) in each device UPS (Uninterruptible Power Supply) for short outages Generators for extended power failures

Performance Monitoring

Effective network operations require continuous monitoring of key performance metrics. Understanding these metrics helps identify issues before they impact users and validates that the network meets its SLAs.

Metric	What It Measures	Why It Matters
Latency	Round-trip time (measured by ping)	High latency degrades user experience
Jitter	Variation in latency over time	Critical for VoIP and video -- causes choppy audio
Bandwidth	Maximum theoretical capacity	The pipe size -- 1 Gbps link capacity
Throughput	Actual data transfer rate achieved	Always less than bandwidth due to overhead
Packet Loss	Percentage of dropped packets	Even 1-2% causes noticeable degradation

/* QoS (Quality of Service) */ DSCP Marking Tags packets with priority levels Priority Queuing High-priority traffic (VoIP) sent first Traffic Shaping Smooths out bursty traffic to match link capacity Congestion Mgmt Decides which packets to drop when overloaded /* Anomaly Alerting */ Set thresholds based on baseline documentation Alert when metrics deviate from normal ranges Examples: CPU > 90%, link utilization > 80%, packet loss > 1%

Key Takeaways

Document everything -- physical diagrams, logical diagrams, baselines
NEVER make changes without a change management process
SNMPv3 is the only secure version -- v1/v2c are insecure
Syslog severity 0 = most critical, 7 = least
RPO = how much data you can lose, RTO = how long you can be down
Hot sites cost the most but recover the fastest
Full backups take longest but restore fastest