We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.







Summary
Known as: ML Systems Engineer, Distributed Training Engineer, GPU/TPU Kernel Engineer, Performance Engineer, Network Engineer (AI/HPC), Fabric Engineer
Builds, maintains, and improves the systems that researchers use to train models. GPU clusters, distributed training parallelism, network communication, fault tolerance, and the operational layer that keeps multi-week training runs alive.
Specializations
Where the Work Lives
Builds GPU clusters, distributed training systems, and the operational layer for multi-week runs.
Directly enables training at scale — fault tolerance, throughput, and determinism for research teams.
Candidate Archetypes
Builds fault-tolerant training services, checkpoint lifecycles, determinism controls, and cluster-level throughput levers.
Owns GPU-cluster communication paths and collective performance — InfiniBand, NVLink, NCCL — where comms become the binding constraint.
Runs the 24/7 operational layer: anomaly response, crash recovery, and keeping multi-week runs alive.
Company Scale
Only orgs training their own models. Early-stage uses cloud providers unless the company is founded specifically to build training infrastructure.
Featured Roles
If you’re hiring at the AI frontier, let’s talk.