NE-10

Network Operations & Monitoring

Network+ / NE-10
< Network+ Hub

Learning Objectives

Network Documentation

Documentation is the foundation of network operations. Without accurate documentation, troubleshooting takes longer, changes introduce risk, and knowledge walks out the door when staff leave. Every network needs both physical and logical diagrams.

Physical Diagrams Show actual cable runs, rack locations, port assignments, and hardware placement. Layer 1 and Layer 2 views -- where things physically are and how they connect.
Logical Diagrams Show IP addressing, VLANs, subnets, routing topology, and traffic flow. Layer 3 view -- how traffic moves through the network regardless of physical layout.

Additional documentation includes rack diagrams (U-positions of equipment), cable maps and wiring diagrams, and a complete asset inventory covering hardware models, software versions, licensing, and warranty expiration dates.

/* Key Documentation Types */ Rack Diagram Equipment placement in each rack (U1-U42) Cable Map Port-to-port cable connections, labels, types Asset Inventory Hardware, software, licenses, warranty dates IPAM IP Address Management -- tracks all assigned IPs Baseline What "normal" looks like -- CPU, bandwidth, latency /* Baselines are critical -- you can't identify abnormal behavior if you don't know what normal looks like */

Change Management

Unauthorized or undocumented changes cause the majority of network outages. Change management is a formal process that ensures every modification is reviewed, approved, tested, and documented before implementation.

/* The 7-Step Change Management Process */ 1. Submit change request // Describe what, why, and impact 2. Review and assess risk // Classify as Low / Medium / High risk 3. Plan rollback procedure // How to undo the change if it fails 4. Get CAB approval // Change Advisory Board reviews and approves 5. Schedule maintenance window // Minimize impact on users 6. Implement the change // Execute during approved window 7. Verify and document // Confirm success, update all documentation

Configuration management tracks device configurations over time. The running config is what the device is currently using. The golden config (or baseline config) is the approved, known-good configuration that the device should match.

Service Level Agreements (SLAs)

SLAs define contractual commitments: uptime guarantees (e.g., 99.999% = 5.26 minutes downtime/year), response time (how fast support acknowledges an issue), and resolution time (how fast the issue is fixed).

Network Monitoring -- SNMP

Simple Network Management Protocol (SNMP) is the standard for monitoring and managing network devices. It allows a central management station to query devices for performance data and receive alerts when problems occur.

/* SNMP Components */ Manager (NMS) Network Management Station -- polls devices Agent Software on the managed device -- responds to queries MIB Management Information Base -- database of objects /* SNMP Operations */ GET Manager requests a value from the agent SET Manager changes a value on the agent TRAP Agent sends unsolicited alert to the manager /* Ports */ UDP 161 -- Queries (GET/SET) UDP 162 -- Traps
VersionAuthenticationEncryptionNotes
SNMPv1Community stringNoneLegacy, insecure
SNMPv2cCommunity stringNoneMost common, still insecure
SNMPv3Username/passwordAES/DESRecommended, secure
Traps vs Informs

Traps are fire-and-forget -- the agent sends the alert but never knows if the manager received it. Informs require acknowledgment from the manager, providing reliable notification at the cost of more overhead.

Syslog

Syslog is a centralized logging protocol that collects log messages from network devices and servers. A syslog collector (or syslog server) aggregates these messages for analysis, alerting, and compliance. Default port is UDP 514.

LevelKeywordDescription
0EmergencySystem unusable
1AlertImmediate action needed
2CriticalCritical conditions
3ErrorError conditions
4WarningWarning conditions
5NoticeNormal but significant
6InformationalInformational messages
7DebugDebug-level messages

SIEM (Security Information and Event Management) takes log aggregation further by correlating events across multiple devices, applying rules to detect patterns, and generating security alerts. A SIEM can identify attacks that no single device log would reveal on its own.

Flow Data & Packet Capture

Understanding network traffic requires different tools depending on the level of detail needed. Flow data shows traffic patterns and trends. Packet capture provides full content for deep analysis.

NetFlow / sFlow / IPFIX Traffic metadata -- who talked to whom, which ports, how much data. Great for identifying top talkers, bandwidth trends, and anomalies without capturing content.
Packet Capture Full packet content captured with tools like Wireshark or tcpdump. Required for deep protocol analysis and troubleshooting application-layer issues.
/* Traffic Capture Methods */ Port Mirroring (SPAN) Switch copies traffic from one port to another for analysis Configured on the switch -- no additional hardware needed Can impact switch performance under heavy load TAP (Test Access Point) Inline hardware device that copies traffic passively No impact on network performance More reliable than SPAN -- captures everything including errors /* When to use each */ Flow data --> Trends, capacity planning, top talkers Capture --> Deep diagnosis, protocol analysis, forensics

Disaster Recovery

Disaster recovery (DR) planning defines how an organization recovers from catastrophic failures. Four key metrics measure recovery capability:

MetricFull NameMeaning
RPORecovery Point ObjectiveMax acceptable data loss (measured in time)
RTORecovery Time ObjectiveMax acceptable downtime
MTTRMean Time To RepairAverage repair duration
MTBFMean Time Between FailuresAverage uptime between failures
/* DR Site Types (cost vs recovery speed) */ Hot Site Fully operational duplicate -- immediate failover $$$$ Most expensive, fastest recovery Warm Site Hardware ready, data needs restoration $$$ Moderate cost, hours to recover Cold Site Empty facility, everything needs setup $$ Cheapest, days/weeks to recover
Backup Types Full: copies everything (longest backup, fastest restore). Differential: copies changes since last full. Incremental: copies changes since last backup of any type (fastest backup, slowest restore).
DR Testing Tabletop exercise: discussion-based walkthrough. Simulation: test procedures without actual failover. Full failover test: actually switch to the DR site to verify everything works.

High Availability

High availability (HA) eliminates single points of failure so that services remain operational even when components fail. The goal is to maximize uptime through redundancy at every layer.

Active-Active Both systems handle traffic simultaneously. If one fails, the other absorbs the full load. Better resource utilization but more complex configuration.
Active-Passive One system handles traffic while the standby monitors. On failure, the passive system takes over. Simpler but the standby is idle during normal operations.
/* High Availability Components */ NIC Teaming / Link Aggregation Combine multiple NICs into one logical link Provides redundancy and increased bandwidth (LACP) Clustering Multiple servers act as one logical system If one node fails, others continue serving requests Load Balancing Distributes traffic across multiple servers Algorithms: round-robin, least connections, weighted Power Redundancy Dual power supplies (PSU) in each device UPS (Uninterruptible Power Supply) for short outages Generators for extended power failures

Performance Monitoring

Effective network operations require continuous monitoring of key performance metrics. Understanding these metrics helps identify issues before they impact users and validates that the network meets its SLAs.

MetricWhat It MeasuresWhy It Matters
LatencyRound-trip time (measured by ping)High latency degrades user experience
JitterVariation in latency over timeCritical for VoIP and video -- causes choppy audio
BandwidthMaximum theoretical capacityThe pipe size -- 1 Gbps link capacity
ThroughputActual data transfer rate achievedAlways less than bandwidth due to overhead
Packet LossPercentage of dropped packetsEven 1-2% causes noticeable degradation
/* QoS (Quality of Service) */ DSCP Marking Tags packets with priority levels Priority Queuing High-priority traffic (VoIP) sent first Traffic Shaping Smooths out bursty traffic to match link capacity Congestion Mgmt Decides which packets to drop when overloaded /* Anomaly Alerting */ Set thresholds based on baseline documentation Alert when metrics deviate from normal ranges Examples: CPU > 90%, link utilization > 80%, packet loss > 1%

Key Takeaways