Private Draft

The 29 personas behind AI

We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.

Shaped by Industry Experts
Kumar Chellapilla
Kumar ChellapillaVPE
Jennifer Anderson
Jennifer AndersonVPE / Stanford PhD
Thuan Pham
Thuan PhamCTO
Akash Garg
Akash GargCTO
Linghao Zhang
Linghao ZhangResearch Engineer
Wayne Chang
Wayne ChangEarly FB Engineer
Indrajit Khare
Indrajit KhareEM & Head of Product
← ATOMS & ENERGYUSERS & MARKETS →
← Back

Training Data

Curates what models learn from
Training Data

Known as: Data Engineer, ML Engineer, Research Engineer (Data), Program Manager, Operations Lead

Curates and creates the data that shapes model capability. Serves both pre-training and post-training: the same function supplies web-scale corpora for base models and preference data for RLHF. This function is more central than it sounds; what data makes it into training, and how it's weighted, strongly determines what the model can do.

Specializations

Web & Text Data Sourcing Acquires text and multimodal data that already exists: web crawling, data partnerships, licensing, and corpus assembly for language and multimodal models. Typically sits within a data engineering or research-data org.
Physical / Sensor Data Sourcing Fleet collection, petabyte-scale sensor ingestion from deployed hardware, and data pipeline engineering for high-volume sensor streams. Driven almost entirely by Autonomous Systems and Robotics Platforms — the rest of Training Data hiring centers on language and multimodal data. Typically sits within the autonomy or robotics data org, not the general data team.
Synthetic Data Generation Creates training data programmatically. A distinct discipline with dedicated tooling and expertise: model-generated examples, augmented datasets, distillation pipelines, and quality verification of synthetic outputs. Includes closed-loop data improvement — feeding model outputs back through quality filtering to produce new training signal — a flywheel that accelerates as models get better. For physical systems, includes simulation-based data generation and domain randomization.
Data Labeling & Annotation Creates structured training signal through human judgment. Runs the human-in-the-loop programs that power RLHF and RLVR. This is a heavier operational function than it sounds: vendor management, annotator workforce scaling, rubric and guideline design, annotator QA, throughput and reliability, and dedicated product management for annotation tools and pipelines.
Data Evaluation & Quality Determines what data makes it into training and how it's weighted. Quality filtering, deduplication, safety filtering, and data mixing ratios. Uses evaluation methodology (metrics, benchmarks, ablation studies) to measure how data choices affect model capability and closes data quality gaps through targeted collection and filtering. In practice, data teams run the ablations using eval infrastructure and methodology that the Evaluation & Benchmarking function designs and maintains — a producer/consumer relationship where data teams are the heaviest consumers of eval tooling.
[1]Substrate
[2]Compute
[3]Intelligence
Primary

What data makes it into training — and how it's weighted — strongly determines what the model can do.

[4]Systems
[5]Distribution
Clive Silvia
Clive Silvia
Scale AI
Corpus sourcing

Builds and negotiates pipelines for licensed, crawled, and partnered corpora with provenance constraints.

Marva Shelia
Marva Shelia
Anthropic
Synthetic data

Generates targeted data, runs quality gates, and closes the loop by feeding verified outputs back into training.

Jim Harris
Jim Harris
OpenAI
Human feedback ops

Runs the RLHF/RLVR supply chain — rubrics, QA, throughput, vendors, and the annotation tooling surface.

Early-Stage
Occasional
Growth
Common
Enterprise
Primary

Any org training or fine-tuning. Early-stage outsources to Scale/Surge; growth+ builds in-house.

Let’s Find Your Next Builder

If you’re hiring at the AI frontier, let’s talk.