Private Draft

The 29 personas behind AI

We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.

Shaped by Industry Experts

← ATOMS & ENERGYUSERS & MARKETS →

← Back

Evaluation & Benchmarking

Defines what better means

Evaluation & Benchmarking

Summary

Known as: Evaluation Engineer, Research Engineer (Evals), Research Scientist (Evals)

Owns eval methodology, task design, benchmark suites, and launch gating as a cross-cutting function. Defines what "better" means and builds the suites that tell the org whether things are improving. Most personas consume eval signals; this persona designs the methodology and maintains the benchmarks.

Specializations

Capability Evaluation — Task design, grading methodology, benchmark creation and maintenance, contamination analysis. Contamination analysis covers both data leakage (benchmark answers in training sets) and structural contamination — where eval design shapes RL curriculum, inflating scores relative to real-world capability. Evals that double as training targets lose signal value; maintaining measurement integrity against optimization pressure is an ongoing design constraint. Defines the capability frontier the org is pushing against and maintains suites that give training, post-training, and product teams a hill to climb.

Reward Design & Training Feedback — Reward signal design, verifiable reward tasks, curriculum design for RL, reward hacking detection, and RL environment construction — coding sandboxes, tool-use scaffolds, verification harnesses, and increasingly diverse task environments that drive RL generalization the way broad data drove pre-training generalization. The line between "measure the model" and "design the feedback loop it trains on" has collapsed at orgs doing RLHF/RLVR — eval teams now shape training dynamics directly. Includes designing reward functions that remain robust under optimization pressure and building the verification infrastructure that makes RL training loops trustworthy.

Launch Readiness & Regression — Regression suites, product-quality thresholds, cross-release comparisons, go/no-go analysis, and the evidence base that launch decisions rest on. Includes maintaining suite integrity as models evolve — ensuring benchmarks remain informative and don't become optimization targets. Covers both capability regression (did the model get worse at X?) and safety regression (did mitigations hold?) across release candidates.

Eval and reward design are collapsing into the same function at frontier labs — as RLVR scales, the people designing capability benchmarks are increasingly the same people designing reward curricula. This is shifting hiring toward eval engineers who can build training feedback loops, not just measurement suites.

Where the Work Lives

[1]Substrate

[2]Compute

[3]Intelligence

Primary

Defines capability benchmarks, task design, and eval methodology that measure model progress.

[4]Systems

Primary

Owns launch gating, regression suites, and product-quality thresholds for safe deployment.

[5]Distribution

Candidate Archetypes

Shawn Williams

Anthropic

Task designer

Writes tasks, graders, and suites that measure real capability without becoming training targets.

Zak Elwood

OpenAI

Launch gating

Owns thresholds, suite health, and cross-release comparability that drive go/no-go decisions.

Sandra Masha

DeepMind

Measurement integrity

Defends eval signal from benchmark contamination, data leakage, and optimization pressure.

Company Scale

Early-Stage

Occasional

Growth

Common

Enterprise

Primary

Dedicated eval orgs at frontier labs (Anthropic, OpenAI, DeepMind). Growth-stage evals are typically owned by researchers or a senior engineer part-time; dedicated teams emerge at scale.

Featured Roles

Partnership Inquiries

We partner selectively with teams hiring for roles where the right person changes the trajectory.