Storage for AI Clusters: NFS, BeeGFS, Lustre, and the Object-Store Question
Shared storage is the part of a distributed training cluster nobody thinks about until it becomes the reason GPUs sit at 40% utilization. By that point replacing it is painful — every training script has assumptions baked in, every checkpoint is in a format that may or may not migrate cleanly, and the cluster is in production. The right time to decide on storage is before the second training node is racked.
This is the storage half of the K-series. It is a buyer-and-architect view, not a sysadmin walkthrough — the goal is to make the call between NFS, BeeGFS, Lustre, and object stores with eyes open, and to be honest about which one most Kentino customers actually need.
Why shared storage is the silent bottleneck
A single-node trainer reads its dataset from local NVMe and writes checkpoints to the same place. Nothing to discuss. The moment a second node joins, two things change:
- Every node reads the same dataset. The bytes are identical on every worker and have to be visible from every node.
- Every node writes checkpoints, and the union has to survive. With FSDP or DeepSpeed sharding each rank writes its own slice; with DDP, ranks write redundantly. Either way the file system sees a burst of large writes every few minutes to few hours.
These access patterns are opposite. Dataset access is sustained, sequential, read-heavy, latency-tolerant. Checkpoint access is bursty, large-block, write-heavy, and training blocks on it. A storage system that is good at one is often mediocre at the other.
On top of that, modern ML datasets do a third thing: metadata storms. A dataset of 50 million 4–12 KB JPEGs is, from the file system's perspective, 50 million open/stat/read/close cycles. The metadata server, not the data path, falls over first. This is why "we have plenty of bandwidth" is not a sufficient answer.
| Workload | Pattern | What it needs |
|---|---|---|
| Dataset shuffle / read | Sustained sequential reads, TB-scale | Aggregate read bandwidth |
| Small-file dataset | Random metadata + tiny-file reads | Metadata IOPS, low op latency |
| Checkpoint write | Bursty large writes, all nodes at once | Burst write bandwidth, no head-of-line |
| Inference model load | One-shot large read at startup | Tolerant of anything reasonable |
NFS — the default, and it lasts longer than you think
NFS (NFSv3 or v4) is the path of least resistance. Every Linux distribution ships it, every scheduler understands it, and a "storage node" is just a Linux box with a lot of disk, an /etc/exports line, and a fast NIC.
What a sensibly-built NFS server gives you:
- One namespace mounted identically on every compute node.
- Decent sequential read throughput, especially with
nconnecton modern kernels (multiplexing one mount over several TCP connections — the single-stream ceiling has been gone for years). - Operationally boring. Failures are well understood, recovery procedures are in every sysadmin book ever written.
What you give up:
- One server, one bottleneck. Aggregate cluster throughput is capped at whatever that one box serves.
- Locking and metadata serialization. Small-file workloads hit metadata limits before bandwidth limits.
- No graceful scale-out. When the box is full you build a second with a different mount point, and your scripts learn about it.
The unfair reputation NFS has in HPC circles is mostly about the old single-connection ceiling and metadata pain on tiny files. For training datasets in the 100 GB to 1–2 TB range on 4–8 GPU nodes, a properly built NFS server is genuinely fine. We have seen customers train Qwen2.5-VL fine-tunes on NFS-backed datasets without complaint.
NFS starts hurting when:
- Datasets exceed ~2 TB and shuffles cause cache thrash on the server.
- More than ~8–12 compute nodes hit the same mount under load.
- Tens of millions of small files dominate the workload.
- Per-rank checkpoint sizes exceed 50–100 GB written concurrently.
Below all four thresholds, NFS is almost certainly the right answer and a parallel filesystem is over-engineering.
BeeGFS — the practical middle
BeeGFS is the parallel filesystem people actually deploy when NFS stops being enough. It splits data across multiple storage targets (disks on multiple servers) and metadata targets (separate NVMe servers). Clients see one namespace; reads and writes stripe across storage targets in parallel.
Why it shows up in mid-scale clusters:
- Setup is days, not weeks. A competent Linux engineer can stand up a two-storage-node, one-MDS install in an afternoon. Lustre is not in that league.
- Aggregate bandwidth scales linearly with storage targets. Three nodes on 100 GbE deliver close to 3× the read throughput of one.
- Native RDMA on InfiniBand or RoCE — avoids TCP/IP overhead.
- Open source, no per-TB licensing trap. Commercial support from ThinkParQ optional.
What it is honestly less good at:
- Metadata for very-many-small-file workloads. A single MDS becomes the bottleneck under sub-4 KB file dominance. You can add metadata targets, but the architecture is not as aggressive as Lustre's DNE.
- No built-in tiering. Manage hot/cold yourself.
- HA is bolted on, not native. Mirroring works but is not as transparent as Lustre's failover.
Sweet spot: 4–32 training nodes, 10–100 TB of dataset, files in the 1 MB to 1 GB range. Where most multi-node Kentino builds land.
A baseline BeeGFS deployment for a Kentino-class 4-node cluster:
| Component | Spec |
|---|---|
| Storage node 1 | 2U, EPYC 24-core, 256 GB RAM, 12× 7.68 TB NVMe, 2× 100 GbE |
| Storage node 2 | Identical |
| Metadata node | 1U, EPYC 16-core, 192 GB RAM, 2× 3.84 TB NVMe (RAID-1), 2× 100 GbE |
| Network | 100 GbE leaf-spine or single switch (RoCE-capable) |
| Clients | 4× K-AI training nodes, each with 1× 100 GbE |
Sustained aggregate read into the four trainers typically lands in the 30–60 GB/s range, enough to feed 32 GPUs on most vision and LLM training. This stops scaling when the metadata server saturates — add a second metadata target or start thinking about Lustre.
Lustre — gold standard, real complexity
Lustre runs the storage tier of essentially every top-500 supercomputer that needs POSIX, scaling to hundreds of petabytes and multi-TB/s aggregate throughput. Architecture has three roles: MGS/MDS (management + metadata), OSS (object storage servers holding OSTs), and clients. Clients talk to the MDS for namespace ops, then talk directly to whichever OSS holds the data — the data path bypasses the metadata server entirely. This separation is why Lustre scales, and DNE (Distributed Namespace) lets metadata scale horizontally too.
The cost is operational: steeper learning curve than BeeGFS, strict kernel/driver requirements, dozens of tuning knobs (out-of-the-box performance is usually not what the hardware can deliver), and recovery procedures that demand a competent storage engineer.
The threshold where Lustre makes sense is roughly 16–32+ training nodes or PB-class datasets. Below that, BeeGFS saturates the network long before Lustre's architectural advantages matter. Kentino's lineup mostly does not reach the threshold — we build a Lustre cluster if asked, but we tell you first that BeeGFS plus a beefy MDS gets you 80% of the way at 20% of the operational pain.
Object storage — MinIO and Ceph for ML data lakes
The parallel filesystems above are right when the framework expects files. A growing share of modern ML pipelines does not — datasets live as S3 objects, the data loader pulls them over HTTP, and the local filesystem is transient scratch.
MinIO is a single-purpose, S3-compatible object store. Distributed mode runs on 4–32+ nodes with erasure coding. Open source, operationally simpler than any parallel filesystem, and modern data loaders (PyTorch with fsspec/s3fs, WebDataset, NVIDIA DALI) read it directly.
Ceph is broader — block (RBD), file (CephFS), object (RGW) on the same RADOS layer. Ceph wins when you want one storage system for multiple workloads (VMs, file shares, ML objects). It loses on operational simplicity — Ceph is a system you commit to, with its own monitoring, tuning, and on-call discipline.
What object storage gives you for ML: cheap capacity (dense HDD nodes with erasure coding give €/TB numbers flash can't touch), operationally simple scale-out, sharded by design (no metadata bottleneck), and a good match for shard-based data loaders. What you give up: tens-of-ms per-op latency (not microseconds), no direct POSIX (rewrite the loader or accept a FUSE penalty), and no in-place writes (objects are immutable — fine for checkpoints, bad for anything that mutates).
The architecture that works in practice: object storage as the durable, large, cheap data lake; a parallel or NFS filesystem as the fast working set staged for the current run. Dataset lives in MinIO. A job stages relevant shards onto BeeGFS or local NVMe scratch at start. Final checkpoints push back to MinIO for archival.
WekaIO and VAST — the enterprise tier (briefly)
WekaIO and VAST Data are the purpose-built, enterprise-priced answer: extreme small-file metadata, very high aggregate bandwidth, native S3 and POSIX, integrated tiering, GPU-direct paths. These make excellent sense for 100+ GPU clusters where storage must keep pace with GPU spend. They are not the right call for a 4-GPU or 8-GPU server, or a cluster of four such — storage cost would dominate the build. Mentioned here for honesty about the high end, not as a recommendation for the buyer this series is written for.
Dataset access patterns and why they matter
Two patterns dominate ML training. Pattern A — large shards: datasets pre-packed into WebDataset tarballs, parquet, or LMDB, 100 MB to 10 GB per shard, read sequentially and shuffled inside. Every storage system is good at this. Pattern B — many small files: millions of JPEGs/PNGs/JSONs, one per sample, hammering the metadata path. This punishes NFS, hurts BeeGFS at scale, and is why Weka and VAST exist.
The single highest-leverage decision in training storage is moving Pattern B into Pattern A. If your dataset is small files, repack into shards before training. The cost is one preprocessing pass; the benefit is that any storage system becomes adequate. We have seen pipelines speed up 3–5× with no storage hardware change, just by repacking.
If you cannot repack — dataset constantly mutated, framework genuinely requires per-sample files — size storage around metadata IOPS, not bandwidth. That pushes you toward BeeGFS with multiple MDS targets, Weka/VAST, or Lustre with DNE.
Checkpoint strategy — write fast, write later, write less
Checkpoint writes are the worst-case pattern: every rank writes simultaneously, writes are large, training blocks until durable. A naive synchronous checkpoint on a 100 B parameter model can stall training for tens of minutes.
Current best practice is a three-stage approach, native in PyTorch's torch.distributed.checkpoint (DCP), NVIDIA NeMo, and most modern frameworks:
- Each rank writes its shard to local NVMe scratch. Total cluster write bandwidth is the sum of per-node NVMe rates (5–50 GB/s each). Training unblocks within seconds.
- An async background process copies to shared storage. Training has resumed by now.
- Older checkpoints are pruned or moved to object storage on a schedule. Keep last 2–3 on fast storage, the rest to MinIO/Ceph.
The implication: shared storage does not need to absorb every checkpoint at full speed. It needs to ingest the async copy without falling behind the cadence. For a 70 B model checkpointing every 30 minutes, low single-digit GB/s sustained write is enough — well within a competent BeeGFS or a good NFS server.
The mistake to avoid: synchronous, unsharded checkpoints written directly to NFS. That was normal in 2022. It is not anymore. If your pipeline does this, fixing the checkpoint code is higher-leverage than upgrading the storage.
The cluster uplink question
Does storage traffic share the fabric with training, or get a separate one?
Shared fabric is fine for 4–8 node clusters with async checkpointing; split fabric justified above 8 nodes or heavy synchronous I/O.
For Kentino-class 4–8 node clusters, shared fabric is fine when async checkpointing is in use and the storage workload is mostly reads — sustained read coexists with NCCL collectives without serious harm. Split fabric is justified for large LLM training with FSDP at high TP/PP degree, heavy synchronous storage traffic, or when the budget allows the headroom. The practical default: shared 100 GbE up to 8 nodes with a storage VLAN as soft isolation.
Honest recommendations by scale
| Scale | Dataset | Storage tier | Notes |
|---|---|---|---|
| 1 node | Any | Local NVMe | Skip shared storage entirely. |
| 2–3 training nodes | < 1 TB | NFS on a beefy storage node |
nconnect, generous RAM for page cache. |
| 4–8 training nodes | 1–10 TB | NFS, or first BeeGFS deployment | BeeGFS if dataset is small-file-heavy. |
| 4–16 training nodes | 10–100 TB | BeeGFS, 2–3 storage targets, dedicated MDS | Sweet spot. Add MinIO for cold archive. |
| 16–32+ training nodes | 100 TB–1 PB | BeeGFS pushed hard, or Lustre | Decision point for Lustre. |
| 32+ nodes, PB-scale, 24/7 | PB+ | Lustre, or Weka/VAST if budget allows | Past the typical Kentino build. |
| Large data lake, episodic training | PB+ cold | MinIO/Ceph, stage to fast tier per job | "Lots of data, occasional training." |
Most Kentino customers land in rows two and three. We build the rest if asked, but we will tell you when it is over-spec.
What breaks
Failure modes we have seen, ordered by how often they bite:
-
Metadata starvation under small-file load. Symptom: GPUs at 30–50%, network idle, storage CPU idle, every
open()slow. Fix: repack to shards, or scale MDS targets. - Synchronous checkpoints stalling training. Fix: async sharded checkpoints (PyTorch DCP or NeMo).
- One slow disk tanking the cluster. A failing SSD does not fail cleanly — it just gets slow. Fix: monitor per-device latency, alarm on > 2× baseline.
- Page cache thrash on the storage node. Dataset doesn't fit in RAM, every epoch evicts the previous. Fix: more RAM.
-
nconnectnot enabled on NFS clients. Default single-stream NFS caps well below NIC line rate.nconnect=8or=16gives 4–8× throughput on the same hardware. - MinIO at scale without EC tuning. Defaults prioritize durability over latency; small-object latency suffers. Don't accept defaults blindly.
None of these are exotic. All are predictable once seen once.
What to do next
If you are spec'ing storage for a new or growing training cluster, answer these before signing for hardware:
- Dataset shape — large shards or many small files? Single biggest determinant of which storage system fits.
- Active working set, not data lake size. Storage budget goes on the working set; the lake lives on cheaper, slower object storage.
- Checkpoint cadence, size, tolerance for stalls. Sizes burst write capacity and tells you whether async sharded checkpoints are mandatory (usually above ~13 B parameters, yes).
- Training nodes today and realistic 12-month projection. If you will be at 4 nodes for a year, build for 4 nodes. Do not pre-buy Lustre.
- Storage on training fabric, or split? Decide before cabling, not after.
- Cold archive plan. Models, dataset versions, completed run artifacts — these belong somewhere cheap. MinIO or Ceph is usually the answer.
Storage rewards thinking ahead more than almost any other part of an AI cluster, because the cost of getting it wrong is paid every training run for the life of the system. The honest goal is not "the fastest storage we can buy" — it is "the simplest storage that doesn't bottleneck the GPUs we already have," and for most Kentino-scale builds, that is much less exotic than the vendor marketing implies.
Adjacent articles: K01 (single-node vs multi-node), K02 (distributed training mechanics), K03 (inference clusters), N08 (RDMA + cluster uplink), W06 (storage tiers inside a single server).
This is part of the Kentino Wiki, a reference series on AI compute, robotics, and the systems that connect them. Comments and corrections welcome at info@kentino.com.