Switched Cluster Topologies: Fat-Tree, Leaf-Spine, Dragonfly+, Tesseract

Every cluster diagram in a vendor deck starts the same way: a row of boxes labelled "node," a row of boxes labelled "switch," and arrows between them. The diagrams are deliberately simple because the real choice underneath — which topology, with how much oversubscription, with what per-port speed — is the single biggest cost decision in an AI cluster after the GPUs themselves.

This article is the topology layer between N02 (which protocol — InfiniBand, RoCE, plain Ethernet) and N06–N08 (how the wire actually behaves once the topology is wired up). It covers the four families that matter in 2026: fat-tree / Clos / leaf-spine, dragonfly / dragonfly+, tesseract / hypercube, and the torus family that survived in two specific corners of HPC. It ends with the honest call: roughly nine out of ten Kentino customers do not need any of this, and the article exists for the tenth.

Audience: people sizing an 8-to-64 node training cluster and the network around it. Not a Cisco/NVIDIA configuration cookbook — the mental model that makes the cookbook readable.

Three different things people call "bandwidth"

Before we draw boxes and arrows, the vocabulary. Three terms get used interchangeably in cluster sales literature and they are not the same number:

Term	What it actually measures	Where it bites you
Aggregate bandwidth	Sum of all link capacities in the fabric. The number on the vendor data sheet.	Useless on its own. A 1 TB/s aggregate fabric can still be a bottleneck for one flow.
Cross-sectional bandwidth	Throughput across an arbitrary cut through the fabric.	Real workload throughput when traffic is non-uniform — what you measure during allreduce.
Bisection bandwidth	Cross-sectional bandwidth across the worst cut that divides nodes into two equal halves.	The number that determines whether allreduce hits line rate at scale.

A 32-port 400 GbE switch has 12.8 Tb/s of aggregate bandwidth. Put 16 nodes on it at 400 GbE each, and you have 3.2 Tb/s of bisection bandwidth (8 nodes × 400 Gb/s on each side of the cut). For an allreduce step where each of 16 GPUs sends half its gradient across the bisection, that 3.2 Tb/s is what the step time is divided by — not 12.8.

The shortcut: bisection bandwidth is the only one of these three numbers that predicts training step time on an allreduce-bound workload. When a vendor brochure quotes aggregate, mentally take the worst-case cut and divide accordingly to get back to bisection.

For a 16-node cluster with 8 GPUs each (128 GPUs total) on 100 GbE single-NIC nodes:

Topology	Aggregate BW	Cross-sectional BW (avg)	Bisection BW
Single 32-port 100 GbE switch	1.6 Tb/s	800 Gb/s	800 Gb/s
Fat-tree, 1:1 (full bisection)	3.2 Tb/s	1.6 Tb/s	1.6 Tb/s
Fat-tree, 2:1 oversubscribed	2.4 Tb/s	800 Gb/s	800 Gb/s
Dragonfly+ (4 groups of 4)	2.0 Tb/s	~1.0 Tb/s	~800 Gb/s (worst pair)
4D tesseract (switchless)	1.6 Tb/s	~800 Gb/s	800 Gb/s
3D torus 4×2×2	1.5 Tb/s	~600 Gb/s	600 Gb/s

Same node count, same wire speed, different numbers depending on what you mean. This is the framing the rest of the article uses.

Fat-tree, Clos, leaf-spine — the same thing in three accents

Charles Clos proved in 1953 that a multi-stage network of small crossbar switches can be non-blocking — any input can reach any output without contention — at a fraction of the cost of one giant crossbar. Every modern datacenter network is some variant of this idea. The naming has become tangled:

A Clos network is the mathematical structure: ingress, middle, and egress stages of smaller switches.
A fat-tree (Charles Leiserson, 1985) is a Clos variant where the trunks closer to the root get progressively fatter so bisection bandwidth scales with N.
A folded Clos wraps the egress stage back onto the ingress stage. A leaf-spine is a two-tier folded Clos. A three-tier folded Clos with leaf, spine, and super-spine is what most people call a fat-tree in practice.

Two-tier leaf-spine: every leaf connects to every spine. Any two nodes communicate in exactly two hops. Full bisection = no oversubscription at the spine layer.

Every leaf connects to every spine. Any-to-any traffic is at most leaf → spine → leaf, two hops. With enough spine bandwidth, the fabric is non-blocking: every node can simultaneously talk to every other node at line rate.

The oversubscription ratio is the knob that decides cost. If each leaf has 32 downlinks of 100 GbE (3.2 Tb/s into the rack) and 8 uplinks of 100 GbE (800 Gb/s out of the rack), the oversubscription is 4:1 — four times more bandwidth into the rack than out of it. Full bisection means 1:1: as much uplink as downlink. 2:1 is common in general-purpose datacenters. 1:1 (full bisection) is the AI-cluster baseline.

Configuration	Leaf uplinks	Spine count	Approx switch+optics cost (2026)	Bisection BW
Single 64-port 400 GbE switch (one rack)	n/a	1	~$50k	12.8 Tb/s (one rack)
2-tier leaf-spine, 4:1 oversubscribed	8× 100 GbE	2× 32-port	~$120k	800 Gb/s
2-tier leaf-spine, 2:1 oversubscribed	16× 100 GbE	4× 32-port	~$180k	1.6 Tb/s
2-tier leaf-spine, full bisection (1:1)	32× 100 GbE	8× 32-port	~$280k	3.2 Tb/s
2-tier, 400 GbE uplinks, full bisection	8× 400 GbE	4× 32-port	~$220k	3.2 Tb/s, fewer cables

The cost roughly doubles going from 4:1 to 1:1 because you are buying twice as many spine ports and twice as many optics. The reason every serious AI cluster pays this premium: oversubscription destroys allreduce throughput. Synchronized 8-flow allreduce on a 4:1 oversubscribed fabric does not run at one-quarter speed — it collapses under PFC backpressure (N07) and can lose 60–80% of theoretical throughput in practice. The math says "divide by 4." Reality says "divide by 5–10."

NVIDIA's DGX SuperPOD reference architecture specifies a three-tier fat-tree with full bisection on Quantum-2 NDR InfiniBand at 400 Gb/s per port. Meta's published RoCE training clusters and Microsoft's Azure ND-series build the same shape on Spectrum-X Ethernet. The industry has converged on full-bisection fat-tree for AI training, and the 2024–2026 evolution is making the fat-tree wider (400 GbE → 800 GbE per port) or rail-optimized (next section), not changing the fundamental topology.

Rail-optimized fat-tree — the AI-specific dialect

The standard fat-tree treats every NIC the same. AI training cares about which GPU's NIC sends which gradient, because allreduce traffic patterns are not uniform. The rail-optimized variant assigns each GPU in a node to a specific "rail" — a dedicated leaf-spine path — and ensures that the i-th GPU on every node talks only to the i-th GPU on every other node through the i-th rail.

Rail-optimized fat-tree: each GPU slot maps to a dedicated independent spine plane. Allreduce ring on GPU 3 uses only Rail 3.

Eight independent two-tier fat-trees, one per GPU slot. Allreduce ring on GPU 3 across 16 nodes uses only Rail 3, never crosses into other rails. Benefits: zero ECMP collisions between rails, simpler routing, lower switch radix per plane. Trade-off: a job that spans GPU slots (tensor-parallel inside a node, data-parallel across nodes) gets split across rails by NCCL anyway, so the topology only helps if the workload aligns. For data-parallel and rail-aware NCCL it is a clear win; for tensor-parallel spanning rails the saving evaporates.

Dragonfly and Dragonfly+ — when you cannot afford fat-tree

The fat-tree's cost grows roughly as N log N — every doubling of node count needs more spine bandwidth, and the third tier doubles the switch count per endpoint. For 1024 nodes a non-blocking three-tier fat-tree is buildable. For 10,000 nodes, the switch count and optics cost get punishing. Dragonfly, proposed by John Kim, William Dally et al. in 2008, was designed specifically to scale past that wall.

The idea: cluster nodes into groups. Inside a group, all switches are densely connected (often a smaller Clos). Between groups, every group has one direct link to every other group. The result is a network with diameter 3 (group-local hop, inter-group hop, group-local hop) that scales to enormous node counts with far fewer long-haul cables than fat-tree.

Dragonfly: dense intra-group Clos, one global link per group pair. Diameter 3. Scales to 1000+ nodes with fewer long-haul cables than fat-tree.

The big saving is optical cabling. Long-haul optics between racks are the most expensive part of a fat-tree. Dragonfly replaces them with one fat link per group pair, not one per leaf-spine combination. For a cluster with G groups of S nodes each, fat-tree needs roughly G × S × log(G × S) cables; dragonfly needs G(G − 1)/2 inter-group cables plus the per-group fabric. At G = 32 groups of 32 nodes (1024 total), the long-haul cable count drops by roughly an order of magnitude.

Dragonfly+ (Mellanox, 2017) refines this for InfiniBand. The intra-group fabric becomes a small bipartite Clos so group expansion does not require re-wiring, and inter-group links use adaptive routing to dodge congested groups. This is the topology in Frontier (ORNL, exascale AMD MI250X) and El Capitan (LLNL, MI300A) — both wired with HPE Slingshot-11 switches in a dragonfly arrangement, three-hop maximum diameter, 12.8 Tb/s per switch.

The catch is the failure mode for small jobs that span groups. In a fat-tree, two nodes at opposite ends of the cluster see the same bisection bandwidth as two nodes one rack apart (modulo hop count). In a dragonfly, two nodes in different groups share their inter-group link with every other cross-group flow. If your 16-GPU training job lands on 8 nodes in group A and 8 in group B, you are sharing one inter-group link with everyone else who happens to span the same pair. Adaptive routing helps; it does not eliminate the contention.

Practical implication: dragonfly works beautifully at hyperscaler problem sizes (1000+ nodes, jobs sized to fill groups) and not so well for medium clusters with diverse small jobs. It is the wrong topology for a 16-node training cluster — fat-tree is cheaper and faster at that scale. It is the right topology for a 1024-node mixed-workload supercomputer.

Tesseract — the 4D hypercube

A tesseract is a 4D hypercube: 16 vertices, each connected to exactly 4 neighbours, diameter 4 (longest shortest path between any two nodes). Generalize to k dimensions and you get a k-cube: 2^k nodes, each with k direct links, diameter k. Hamming-distance routing — XOR source and destination addresses, flip one bit at a time — is trivially deterministic and load-balanced under random traffic.

Tesseract (4D hypercube): 16 nodes, each with 4 neighbours. Solid lines = 3D cube edges; dashed lines = 4th-dimension links. Diameter 4. Each node label is a 4-bit address; neighbours differ by exactly one bit.

Hypercube topologies dominated 1980s massively-parallel computing. The Connection Machine CM-2 (Thinking Machines, 1987) was 65,536 nodes wired as a 12-dimensional hypercube. Intel iPSC/2 ran 7D hypercubes. The CM-5 (Thinking Machines, 1991) abandoned hypercubes for fat-tree because the hypercube approach did not scale gracefully past about 1024 nodes — every new dimension doubles the node count and requires re-cabling every existing node.

In 2026 the term "tesseract" still pops up in three places worth distinguishing:

As a research / DiRAC HPC system name. The DiRAC Tesseract at EPCC (Edinburgh) is a 1476-node HPE SGI 8600 cluster on Intel Omni-Path. "Tesseract" is branding; the fabric is closer to fat-tree.
As an "SDN control-plane" research term (Tesseract: a 4D control plane, Yan et al.). Unrelated to physical topology.
As the underlying topology of compact switchless accelerator clusters. A 16-node cluster wired as a literal 4D hypercube has interesting properties: every node has exactly 4 NICs, no central switch, deterministic routing, diameter 4. We cover this properly in N05 (switchless topologies).

What a tesseract offers in 2026: no switch tax, deterministic routing via Hamming-distance XOR, and low diameter (log₂(N)). What makes it hard: fixed N (must be a power of 2), cabling complexity grows with dimension, per-node NIC count equals k, and modern AI collectives (NCCL ring/tree) do not exploit the hypercube structure natively.

Torus — the survivor in two specific corners

The k-ary n-cube generalizes the hypercube: instead of a binary address with one link per dimension, use a k-by-k-by-k grid with wrap-around. A 3D torus has each node connected to 6 neighbours (±x, ±y, ±z). A 6D torus has 12 neighbours.

IBM Blue Gene/L and /P ran on a 3D torus, scaling to hundreds of thousands of nodes with each node having only 6 high-speed links. Fujitsu Tofu (the K computer's interconnect, 2011) generalized this to a 6D mesh/torus — 158,976 nodes on Fugaku (active through 2026), arranged 24×23×24×2×3×2.

Cerebras Wafer-Scale Engine uses a 2D torus on-wafer: every processing element has 4 neighbours, wrap-around, ~1 ns per hop. That works because on-wafer wires are nearly free; off-wafer cables would not be.

Why torus lost everywhere else: asymmetric paths and bad behaviour for non-uniform AI workloads. Modern AI workloads (NCCL ring/tree, NVIDIA's hierarchical algorithms) assume bandwidth-uniform any-to-any. Torus violates that. In 2026 torus survives in three places: Cerebras's on-wafer interconnect, Fujitsu Fugaku and successors, and inside SXM nodes via NVSwitch. Outside those niches, every new AI cluster in 2025–2026 is Clos.

Comparison table

Topology	Diameter	Bisection BW (16 nodes, 100 GbE)	Switches required	Cables (approx)	Cost ratio vs fat-tree 1:1	Growth model
Single switch	1	800 Gb/s (switch-limited)	1× 32-port	16	0.3×	Hard cap at switch radix
Fat-tree 1:1 (full bisection)	2	1.6 Tb/s	2 spine + 2 leaf	64	1.0×	Add leaves / spines
Fat-tree 2:1	2	800 Gb/s	2 spine + 2 leaf	48	0.7×	Add leaves
Dragonfly+	3	800 Gb/s (group-pair limited)	4 (2 per group)	32–40	0.6× at 16N; flips above 64N	Add groups
4D tesseract (switchless)	4	~800 Gb/s (effective)	0	32	0.4×	Doubles by adding a dim
3D torus (4×2×2, switchless)	4	~600 Gb/s	0	48	0.5×	Any rectangular size

Cluster uplink — how the topology meets the outside world

A switched fabric is an island. It has to connect to the corporate network (model registries, dataset storage, S3, telemetry), to developer workstations (SSH, Jupyter, copy-out of checkpoints), and to other clusters (training → inference handoff). That connection is the cluster uplink.

Two models, with very different consequences:

Single uplink point. A pair of spine switches (or a dedicated uplink router) terminates all external connectivity. Simple to firewall, easy to rate-limit, simple to monitor. Failure mode: that link is a single point of failure; saturating it (a big checkpoint copy out, a 10 GB dataset shard pull) impacts every node simultaneously.

Distributed uplink. Each leaf has a separate uplink to the campus network, often slower 25 GbE on top of the 100 GbE fabric. Dataset pulls and external traffic stay local to the leaf — no congestion on the internal fabric. Failure mode: every leaf is a security boundary, firewalling is N times more work, monitoring is harder.

For the Kentino base case (4–16 node training cluster), the single uplink point is the right answer. The internal fabric is RDMA-only (RoCE or InfiniBand), tuned for low latency and lossless behaviour. The uplink is plain Ethernet, TCP, normal QoS. Do not put the dataset object store on the same lossless fabric as the GPU allreduce — a misbehaving S3 client should not be able to trigger PFC backpressure on training traffic. Two fabrics: data plane (lossless RDMA) and management/uplink plane (lossy TCP). N08 covers the practical setup.

The Kentino honest take

Most Kentino customers buy 1 to 4 nodes. At that scale:

1 node. No topology question. PCIe inside the box (K07), one 25 GbE management NIC out, done.
2 nodes. A direct cable between two RDMA NICs. No switch. No topology to choose.
3–4 nodes. A single 32-port 100 GbE switch handles every-to-every with full bisection at $30k–50k all-in. Still no topology to choose.

The topology conversation starts at 8 nodes, when one switch's port count gets tight, and becomes mandatory at 16 nodes. Below that, the right answer is "one good switch, full bisection on every port, get on with your life." Above that, the right answer is "two-tier leaf-spine, 100 or 200 GbE per node, full bisection (1:1), and never touch the oversubscription knob unless someone forces you."

Dragonfly+ is the right answer at hyperscaler problem sizes. Tesseract / hypercube is interesting as a switchless option for compact clusters (N05). Torus is a vendor-locked choice for HPC operators with topology-aware workloads. For everyone else in the Kentino price band, fat-tree is the default. Full bisection if you can afford it; 2:1 if you cannot; never 4:1 for AI training.

What to do next

If you are sizing a switched fabric for a real cluster:

Write down node count, GPUs per node, line rate per NIC. Multiply. Divide by 2. That is your target bisection number.
Decide if your jobs span the whole cluster or sit within one rack. Rack-local jobs tolerate oversubscription. Cluster-spanning jobs do not.
Run nccl-tests/all_reduce_perf on a temporary fat-tree config before committing to the cable run. If 8-node allreduce already loses 20% of theoretical busbw, you have a different problem than topology.
Don't optimize for the next 5 years. Buy for the cluster you need this year with a clear expansion path. Fat-tree leaf-spine is the cheapest topology to grow incrementally.
Match the uplink to your dataset ingest rate, not to the internal fabric speed. Most clusters need 25–100 GbE outbound, not 400.
Two fabrics, always. Data plane and management plane separate, even at 4 nodes.

The follow-ups in this track go deeper: N05 covers switchless topologies (the tesseract and torus options when you genuinely want no switch); N06 dissects where every microsecond of latency comes from once the fabric is up; N07 covers the routing and congestion-control work that decides whether your beautiful topology actually performs; N08 is the hands-on RDMA setup and cluster-uplink design.

This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.

Артикулът е добавен в количката