Switchless Topologies: Mesh, Ring, and Direct-Connect for Small AI Clusters
A 32-port 400 GbE switch with the optics, breakouts, and software entitlements to actually use it lands somewhere between €40k and €80k in mid-2026, and a 64-port NDR InfiniBand switch is worse. For a customer building a two- to four-node training rig, the switch can cost more than the GPUs in one of the nodes. It also adds a hop of latency, a single point of failure, a separate firmware lifecycle, and a small project's worth of PFC/ECN tuning if you are running RoCE.
The fact that almost nobody talks about is that below roughly eight nodes you do not need a switch at all. You can cable the NICs to each other directly, skip the switch entirely, and end up with something simpler, cheaper, and a little faster. The fact almost nobody admits is that above roughly eight nodes switchless falls off a cliff, and the cabling, port count, and operational story stop being defensible. This article maps that range honestly.
It is the companion to N04 (switched topologies). Read N06 for the latency math the switchless win is built on, and K07 for what a single base node looks like — the building block this article connects.
The case for switchless
Four things you get for free by removing the switch:
- Zero switch latency. A modern cut-through Ethernet AI switch eats 400–600 ns per hop. An NDR InfiniBand switch is under 100 ns. A direct NIC-to-NIC cable adds wire delay (~5 ns/m on copper, the same on fibre) and nothing else. On a two-node ping-pong, this drops one-way latency from ~2 µs to ~1.2–1.5 µs.
- Zero switch cost. A two-node direct connect is two NICs and one DAC. A three-node triangle is three NICs and three DACs. The capex saving versus even a small 100 GbE switch with QSFP28 optics is real money on a small build — €10k–€30k that goes back to GPUs.
- No PFC/ECN headaches. A direct NIC-to-NIC link is point-to-point — flow control is a two-party conversation, PFC degenerates to "tell the peer to stop." There is no fabric-wide pause propagation pathology because there is no fabric.
- One device class to debug. When something breaks on a switchless fabric, the suspect list is two NICs, one cable, and the kernel drivers on both ends. That is a small, finite search space.
The two-node case: just plug them together
This is the cleanest possible AI-cluster setup, and the one where switchless is unambiguously correct.
- 8× RTX 5090
- ConnectX-7 OSFP (400 Gb/s)
- 8× RTX 5090
- ConnectX-7 OSFP (400 Gb/s)
Two-node direct connect: one passive OSFP DAC, no switch, ~0.8–1.2 µs RDMA latency, ~50 GB/s usable throughput per direction.
A single OSFP DAC between two ConnectX-7 NICs at 400 Gb/s. That is the entire inter-node fabric. The same NICs that would face a switch in a larger build face each other instead. RDMA verbs work, NCCL picks it up automatically, GPUDirect RDMA runs unchanged.
What you get: ~50 GB/s per direction usable, sub-microsecond raw RDMA latency (ib_send_lat lands around 0.8–1.2 µs), one cable. No aggregation problem because there is nothing to aggregate. No oversubscription because there is no fan-out point.
For a two-node training pair — the most common "I have outgrown one box" build in our customer base — this is the right answer. Skip the switch. Plug them together. Spend the saved money on a bigger NVMe tier or a second NIC port for redundancy.
A practical refinement: use a dual-port ConnectX-7 and run two parallel 200 Gb/s DACs between the boxes, with NCCL configured to use both HCAs (NCCL_IB_HCA=mlx5_0,mlx5_1). You lose a little bit of per-flow peak but you get path redundancy and slightly better small-message behaviour from parallel queue pairs. We default to this on two-node builds.
Three and four nodes: triangle and K₄
Three nodes is the smallest case where the topology starts to matter. The options are:
- Linear chain (A-B-C). Two cables. Diameter 2. Node B is a hotspot — all A-to-C traffic crosses it. Avoid.
- Triangle (full mesh). Three cables. Diameter 1. Every node has two ports. Every flow is one hop. This is the right answer.
Left: triangle (K₃) — 3 nodes, 3 cables, diameter 1. Right: K₄ full mesh — 4 nodes, 6 cables, diameter 1. Every pair directly connected.
Four nodes is where it gets interesting. The full mesh — the complete graph K₄ — has six links total, three ports per node, and diameter 1. Every node reaches every other in exactly one hop. The cabling math:
| Nodes | Full-mesh links | Ports per node | Diameter |
|---|---|---|---|
| 2 | 1 | 1 | 1 |
| 3 | 3 | 2 | 1 |
| 4 | 6 | 3 | 1 |
| 5 | 10 | 4 | 1 |
| 6 | 15 | 5 | 1 |
| 7 | 21 | 6 | 1 |
| 8 | 28 | 7 | 1 |
Full mesh ports-per-node is N-1, which is why this approach implodes fast. At eight nodes you need seven ports per box, which is the end of the practical road on a single PCIe Gen5 x16 slot.
When does four-node full mesh beat a small switch? Specifically when you have 4× K-AI 128 nodes for inference and want them tightly coupled, you are running RoCE and do not want PFC across a switch, and the marginal €15k–€25k of a 100 GbE switch with optics is meaningful in the budget.
When does a small switch win even at four nodes? When you might add a fifth node next quarter. Adding one node to a K₄ mesh requires recabling every existing node to add the new ports. A switch has spare ports; you just plug in.
The 8-node case: hypercube, with an asterisk
The 3-cube (Q₃) — a hypercube of dimension 3 — is the textbook switchless layout for eight nodes. Each node sits at one corner of a cube; each edge of the cube is a direct link. Three ports per node, twelve links total, diameter 3.
| Property | Value |
|---|---|
| Nodes | 8 |
| Links | 12 |
| Ports per node | 3 |
| Diameter | 3 |
| Bisection bandwidth | 4 links |
The honest take: this is rare in production. It works, the diameter-3 worst case is acceptable for most collectives, but the cabling diagram is genuinely confusing to anyone who did not build it, troubleshooting requires understanding the Gray-code labelling, and a small 16-port 200 GbE switch is now in the same price band as the extra NIC ports and cables. The 8-node hypercube is more interesting as a teaching example than as a thing we ship. At eight nodes, our default recommendation is a switch.
The ring: dumb, simple, and surprisingly relevant
Forget about minimizing diameter. The ring connects each node only to its two neighbours: A-B-C-D-...-A. Two ports per node regardless of cluster size. N links total. Diameter N/2.
This sounds terrible — diameter 4 on 8 nodes, diameter 16 on 32 nodes. Why is it not always wrong?
Because NCCL's ring allreduce maps onto a physical ring exactly. The algorithm sends each chunk of data once around the ring per phase; if the physical topology already is a ring, the algorithm runs at line rate of a single link, with no wasted bandwidth. NCCL's default for medium-to-large messages is ring, not tree, because ring achieves the optimal bandwidth bound: 2(N-1)/N × link bandwidth for allreduce. The diameter of the physical topology does not matter at large message sizes — what matters is that every link is used in parallel, and the ring does that perfectly.
The pragmatic place where a physical ring is the right switchless answer is 4–8 node training rigs where every node has exactly two RDMA ports already. The catch: ring has no path redundancy. One bad cable splits the cluster into two pieces.
When switchless beats a small switch, in numbers
| Topology | Nodes | Links | Ports/node | Diameter | Bisection (links) |
|---|---|---|---|---|---|
| Direct connect | 2 | 1 | 1 | 1 | 1 |
| Triangle (K₃) | 3 | 3 | 2 | 1 | 2 |
| K₄ full mesh | 4 | 6 | 3 | 1 | 4 |
| 4-node ring | 4 | 4 | 2 | 2 | 2 |
| 8-node ring | 8 | 8 | 2 | 4 | 2 |
| 8-node Q₃ cube | 8 | 12 | 3 | 3 | 4 |
| 16-node Q₄ | 16 | 32 | 4 | 4 | 8 |
| 8-node star (switched) | 8 | 8 | 1 | 2 | depends on switch |
Approximate price comparison for an 8-node fabric build, mid-2026 (EUR ex VAT):
| Approach | NICs needed | Cables | Switch | Total band |
|---|---|---|---|---|
| 8-node, single-switch 200 GbE star | 8× single-port 200 GbE | 8× DAC | ~€18–28k | €25–35k |
| 8-node ring, switchless | 8× dual-port 200 GbE | 8× DAC | none | €15–22k |
| 8-node Q₃ cube, switchless | 8× tri-port equivalent | 12× DAC | none | €18–26k |
| 4-node K₄ mesh, switchless | 4× tri-port equivalent | 6× DAC | none | €9–13k |
| 4-node, small 100 GbE switch | 4× single-port 100 GbE | 4× DAC | ~€8–12k | €11–16k |
| 2-node direct | 2× single-port 400 GbE | 1× DAC | none | €3–5k |
The crossover where the switch pays for itself is around 6–8 nodes, depending on bandwidth tier and whether you intend to grow.
Uplink: the part people forget
A switchless data fabric is internally self-contained. It is not, by itself, connected to anything. The cluster still needs an uplink for dataset and model pulls from corporate storage, SSH from developer workstations, telemetry to Prometheus/Grafana, IPMI/BMC management, and container registry traffic.
Pattern A — every node has a separate management NIC. Each node carries one small 25 GbE (or even 10 GbE) port to a cheap management switch, completely independent of the RDMA fabric. This is the right answer almost always. The RDMA fabric is a sterile, lossless, tuned environment; the management plane is a normal Ethernet network with normal traffic. Mix them and the management traffic disrupts your collectives.
Pattern B — dedicated uplink node. One node in the cluster has an extra port that connects out. Other nodes reach the outside world by routing through this node. Works for tight budgets and small lab setups, but the uplink node becomes a bottleneck for dataset reads and a single point of failure for management access.
The hard wall at ~16 nodes
Switchless dies above 16 nodes for three independent reasons, any one of which is sufficient:
-
Port count per node. Full mesh wants
N-1ports per node. Hypercube wantslog₂(N). Even the log scaling means 16 nodes need 4 ports per node, which is at the edge of practical NIC density on a single PCIe Gen5 x16 slot. 32 nodes need 5 ports per node — multiple slots, multiple NUMA placements to manage. - Cabling combinatorics. A 16-node K₄ full mesh has 120 cables. A 16-node Q₄ hypercube has 32. Either way, labelling, documentation, and physical access to each cable matter. One miswired cable in a 32-cable hypercube takes hours to find.
- Operational story. Replacing a failed NIC in a switchless fabric requires identifying the N-1 (or log N) cables that connected it, re-routing each one to a specific port on the replacement. The MTTR difference versus switched is real.
The honest summary: switchless is right for 2 to 4 dedicated nodes, defensible for 5 to 8 nodes with a clear "we will not grow" commitment, and a mistake for 9 or more nodes. At 9+, buy a switch.
Two concrete builds worth describing
2× K-AI 256 Turin Dual, direct-connected, 400G. Two 8-GPU EPYC Turin nodes (5090 or RTX Pro 6000 Blackwell), each with a single-port ConnectX-7 400 GbE / NDR, one 3 m passive OSFP DAC between them. Total inter-node hardware cost: ~€4k. NCCL allreduce busbw on large messages: ~45 GB/s. Suitable for two-way tensor-parallel inference of a 405B dense model (split layers across the two boxes), or fine-tuning a 70B that does not quite fit on one box. We have shipped variants of this build several times. It is boring, it works, it costs an order of magnitude less than the equivalent switch-attached setup.
4× K-AI 128 in K₄ full mesh, 100G. Four single-socket EPYC nodes with 4× RTX Pro 6000 Blackwell each. Each node carries a tri-port-equivalent NIC layout (one dual-port plus one single-port, or one quad-port with one port unused), 100 GbE DAC fabric. Six cables total. Bisection bandwidth 400 Gb/s. Used for tensor-parallel inference of a 70B-class model with 4-way splitting and full activation passing between every pair. Eliminates the switch as a single point of failure for the inference service, and the customer's budget went to GPUs instead of switching gear. Trade-off: locked at four nodes; growing requires re-architecting.
When switchless wins
- 2 nodes — always switchless. No real argument for a switch.
- 3 nodes — switchless triangle. Three cables, every node one hop away. Trivial.
- 4 nodes — switchless K₄ if you will not grow, otherwise a small switch. Both are defensible; growth assumption is the deciding factor.
- 5 to 8 nodes — usually switched. Ring is plausible for bandwidth-bound work, hypercube for the truly committed. Either is harder to defend than just buying a 16-port switch.
- 9 or more nodes — switched. Always. Switchless past this point is a mistake disguised as a saving.
If you are sizing a small AI cluster and the switch line item is making the BOM hurt:
- Count the nodes you actually need. Not "for the next five years." This year and next. If the honest answer is 2–4, the switchless path is real and worth pricing.
- Map the NIC layout. ConnectX-7 dual-port 200 Gb/s QSFP112 is the most common direct-connect part in our 2026 builds. Quad-port SFP56 is the option for higher node counts at lower per-port speed.
- Decide on growth posture. If there is any meaningful chance of going past 8 nodes, just buy the small switch now. Recabling a mesh later is genuinely painful.
- Plan the management plane separately. Switchless data fabric, switched management plane on cheap 10 GbE. Do not collapse them onto one set of cables.
-
Run
nccl-testson the as-built topology before declaring victory. TheNCCL_DEBUG=INFOoutput tells you which physical links NCCL is actually using; cross-check against the diagram. - Document the cabling. Photographs, port labels, a one-page diagram in the rack. The first time a NIC fails at 02:00, you will be glad.
The follow-up articles to read: N04 for the switched alternative, N06 for the latency dissection that justifies the switchless win, N02 for the InfiniBand vs RoCE call that affects which NICs you buy, and K07 for the base node that all of this connects.
This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.