Known as: Data Engineer, ML Engineer, Research Engineer (Data), Program Manager, Operations Lead
Curates and creates the data that shapes model capability. Serves both pre-training and post-training: the same function supplies web-scale corpora for base models and preference data for RLHF. This function is more central than it sounds; what data makes it into training, and how it's weighted, strongly determines what the model can do.
Web & Text Data Sourcing — Acquires text and multimodal data that already exists: web crawling, data partnerships, licensing, and corpus assembly for language and multimodal models. Typically sits within a data engineering or research-data org.
Physical / Sensor Data Sourcing — Fleet collection, petabyte-scale sensor ingestion from deployed hardware, and data pipeline engineering for high-volume sensor streams. Driven almost entirely by Autonomous Systems and Robotics Platforms — the rest of Training Data hiring centers on language and multimodal data. Typically sits within the autonomy or robotics data org, not the general data team.
Synthetic Data Generation — Creates training data programmatically. A distinct discipline with dedicated tooling and expertise: model-generated examples, augmented datasets, distillation pipelines, and quality verification of synthetic outputs. Includes closed-loop data improvement — feeding model outputs back through quality filtering to produce new training signal — a flywheel that accelerates as models get better. For physical systems, includes simulation-based data generation and domain randomization.
Data Labeling & Annotation — Creates structured training signal through human judgment. Runs the human-in-the-loop programs that power RLHF and RLVR. This is a heavier operational function than it sounds: vendor management, annotator workforce scaling, rubric and guideline design, annotator QA, throughput and reliability, and dedicated product management for annotation tools and pipelines.
Data Evaluation & Quality — Determines what data makes it into training and how it's weighted. Quality filtering, deduplication, safety filtering, and data mixing ratios. Uses evaluation methodology (metrics, benchmarks, ablation studies) to measure how data choices affect model capability and closes data quality gaps through targeted collection and filtering. In practice, data teams run the ablations using eval infrastructure and methodology that the Evaluation & Benchmarking function designs and maintains — a producer/consumer relationship where data teams are the heaviest consumers of eval tooling.