OPEN PLATFORM / HPC NETWORKING / COMMUNITY REVIEWED

Master the fabricthat runs AI.

InfiniBand. RoCEv2. RDMA. DGX SuperPOD. Spectrum-X. Learn AI and HPC networking through interactive chapters, stateful labs, and a simulator built for network engineers. Sign in once, then keep your progress and discussion history attached to one account.

FabricLab CLI

scroll to explore

THE GAP

There is no Packet Tracer for HPC networking.

Network engineers who can troubleshoot BGP, reason about ECMP, and design VxLAN fabrics still walk into AI data centers and find an unfamiliar world. The knowledge is fragmented across vendor docs, conference talks, and incident writeups.

FabricLab turns that scattered knowledge into a structured, reviewable, interactive platform. Chapters explain the hardware and protocols. Labs let you test commands against live state. The community can then sharpen both.

400G

per GPU rail in a modern DGX training fabric

0

packet drops tolerated in healthy RDMA training flows

17

published chapters live in the open catalog today

CURRICULUM

A structured path from hardware to protocol.

17 chapters. 12 scenario labs. One simulator. Browse the catalog in public, then sign in to open the learning surfaces.

Foundations

Chapter 0

The Hardware Story

Physical layer orientation. What an HCA is, why NICs became DPUs, how a DGX node is wired, the three separate networks.

Available
Foundations

Chapter 1

Operating Systems and Management Platforms

What runs on every device. How you access it after power-on. The management philosophy. CLI vs orchestrated. First power-on sequence.

Available
Foundations

Chapter 2

Why HPC Networking Is Different

The AllReduce barrier, why TCP fails, tail latency math, and the mental model shift from enterprise to AI networking.

Available
Operations

Chapter 3

The CLI - Reading the Fabric

The commands and discipline for reading HPC fabric state. Which commands run where, how to read their output, and the investigation workflow from physical layer to configuration.

Available
Operations

Chapter 4

InfiniBand Operations - ONYX CLI and Fabric Management

The InfiniBand operations layer: ONYX CLI, error counter interpretation, Subnet Manager management, ibdiagnet fabric sweep, and UFM event correlation.

Available
Operations

Chapter 5

PFC, ECN, and Congestion Control

How losslessness actually works: PFC mechanics at the wire level, pause storm formation, ECN CE bit marking, DCQCN rate control algorithm, and the complete RoCEv2 port configuration checklist.

Available
Operations

Chapter 6

Efficient Load Balancing

Why AI traffic is structurally low-entropy and how that breaks ECMP. The four load balancing modes (SLB, DLB, GLB, sDLB). Per-packet spraying and RSHP. In-cast congestion patterns and how to diagnose them from spine utilisation counters.

Available
Operations

Chapter 7

Topology Design

How AI fabric scales from one switch to a SuperPOD. Fat-tree topology math, bisection bandwidth, BasePOD vs SuperPOD reference designs, oversubscription calculations, ROD vs RUD wiring, switch buffer selection, and cabling constraints.

Available
Operations

Chapter 8

NCCL - The Application Layer

How NCCL translates AllReduce into RDMA operations. Ring vs Tree vs Double-Binary Tree algorithms. The environment variables that determine whether NCCL finds RDMA or falls back to TCP. Reading nccl-tests output. Correlating busbw degradation to fabric diagnostics.

Available
Infrastructure

Chapter 9

Optics, Cabling, and the Physical Layer

The physical layer beneath the fabric: 400G/800G optics, DSPs, fiber types, form factors, cable selection, and why signal integrity and power density now shape AI cluster design.

Available
Infrastructure

Chapter 10

The Storage Fabric

The separate network that feeds and protects training: storage isolation, GDS data paths, NVMe-oF transports, parallel file systems, checkpoint economics, and storage topology choices.

Available
Infrastructure

Chapter 11

Monitoring, Telemetry, and Observability

Know about problems before the ML engineer's Slack message arrives. UFM REST API, DCGM GPU metrics, Prometheus alert design, threshold calibration, and cross-layer correlation across four monitoring streams.

Available
Architecture

Chapter 12

Scale-Up Networking - NVLink Switch System

External NVLink Switch modules, 57.6 TB/s all-to-all at 256 GPUs, NVLink Network addressing, scale-up vs scale-out architecture decisions, and NVLink Switch diagnostics.

Available
Architecture

Chapter 13

Alternative Topologies

Torus, folded torus, dragonfly, and TPU Pod design choices - where they came from, what workloads they suit, and why fat-tree remains dominant for AI training clusters.

Available
Infrastructure

Chapter 14

GPU Hardware Generations

Network-relevant implications of GPU generations: NVLink/NVSwitch generation table, SXM vs PCIe form factors, GH200, H100 CNX, and Confidential Computing.

Available
Architecture

Chapter 15

IP Routing for AI/ML Fabrics

How modern AI fabrics use routed Ethernet: BGP unnumbered, ASN design, BGP DPF, RIFT comparisons, Flex Algo, SRv6 path steering, and multi-tenant EVPN-VXLAN design.

Available
Architecture

Chapter 16

The GPU Compute Network - Packet Anatomy

A packet-level walkthrough from NCCL work queue entries to remote DMA completion: DGX interfaces, Queue Pair mechanics, ConnectX-7 processing, switch forwarding, and end-to-end packet decode.

Available

HOW IT WORKS

Three systems. One learning loop.

Structured chapters

Each chapter connects physical hardware, transport behavior, congestion control, and operator workflow into one narrative with visual support.

Stateful CLI labs

The simulator is not static text. Commands read live lab state, so outputs change as you diagnose faults, fix configuration, or recover links.

Community feedback loop

Readers can comment directly on chapters and labs, report technical glitches, and help tighten the curriculum without waiting for a closed release cycle.

WHO THIS IS FOR

Built by a network engineer, for network engineers.

CCNP / CCIE engineers

You can read BGP tables and design VxLAN fabrics. You have not spent time inside InfiniBand or RoCE yet. FabricLab is the transition path.

HPC cluster administrators

You manage the servers, but the fabric still feels opaque. FabricLab closes the gap between compute operations and network operations.

Cloud and platform architects

You are designing GPU infrastructure and need to understand what lossless fabrics demand at the protocol and operational level.

Network engineers growing into AI infrastructure

AI fabrics are a fast-moving specialisation. FabricLab gives you a structured path before the first production incident lands on your desk.

COMMUNITY

Keep the platform open. Make it sharper every week.

Comment where the issue appears

Leave technical corrections, lab glitches, and operator notes directly on the relevant chapter or lab page.

Contribute through the repo

The repository is set up for focused fixes, issue reports, new labs, and chapter improvements without turning the platform into a private product wall.

Support without restricting access

FabricLab stays openly accessible. Support links are optional and simply help fund more chapter review, better labs, and platform polish.

START LEARNING

Learn the fabric in the open.

Browse the catalog in public, then sign in to open chapters and labs, join tracked discussions, and keep your progress attached to one account.