Private Draft

The 29 personas behind AI

We’ve organized every stage and persona in the AI supply chain, informed by real recruiting at frontier companies. Click any row to see matching profiles from our talent graph.

Shaped by Industry Experts
Kumar Chellapilla
Kumar ChellapillaVPE
Jennifer Anderson
Jennifer AndersonVPE / Stanford PhD
Thuan Pham
Thuan PhamCTO
Akash Garg
Akash GargCTO
Linghao Zhang
Linghao ZhangResearch Engineer
Wayne Chang
Wayne ChangEarly FB Engineer
Indrajit Khare
Indrajit KhareEM & Head of Product
← ATOMS & ENERGYUSERS & MARKETS →
← Back

Evaluation & Benchmarking

Defines what better means
Evaluation & Benchmarking

Known as: Evaluation Engineer, Research Engineer (Evals), Research Scientist (Evals)

Owns eval methodology, task design, benchmark suites, and launch gating as a cross-cutting function. Defines what "better" means and builds the suites that tell the org whether things are improving. Most personas consume eval signals; this persona designs the methodology and maintains the benchmarks.

Specializations

Capability Evaluation Task design, grading methodology, benchmark creation and maintenance, contamination analysis. Contamination analysis covers both data leakage (benchmark answers in training sets) and structural contamination — where eval design shapes RL curriculum, inflating scores relative to real-world capability. Evals that double as training targets lose signal value; maintaining measurement integrity against optimization pressure is an ongoing design constraint. Defines the capability frontier the org is pushing against and maintains suites that give training, post-training, and product teams a hill to climb.
Reward Design & Training Feedback Reward signal design, verifiable reward tasks, curriculum design for RL, reward hacking detection, and RL environment construction — coding sandboxes, tool-use scaffolds, verification harnesses, and increasingly diverse task environments that drive RL generalization the way broad data drove pre-training generalization. The line between "measure the model" and "design the feedback loop it trains on" has collapsed at orgs doing RLHF/RLVR — eval teams now shape training dynamics directly. Includes designing reward functions that remain robust under optimization pressure and building the verification infrastructure that makes RL training loops trustworthy.
Launch Readiness & Regression Regression suites, product-quality thresholds, cross-release comparisons, go/no-go analysis, and the evidence base that launch decisions rest on. Includes maintaining suite integrity as models evolve — ensuring benchmarks remain informative and don't become optimization targets. Covers both capability regression (did the model get worse at X?) and safety regression (did mitigations hold?) across release candidates.

Eval and reward design are collapsing into the same function at frontier labs — as RLVR scales, the people designing capability benchmarks are increasingly the same people designing reward curricula. This is shifting hiring toward eval engineers who can build training feedback loops, not just measurement suites.

[1]Substrate
[2]Compute
[3]Intelligence
Primary

Defines capability benchmarks, task design, and eval methodology that measure model progress.

[4]Systems
Primary

Owns launch gating, regression suites, and product-quality thresholds for safe deployment.

[5]Distribution
Shawn Williams
Shawn Williams
Anthropic
Task designer

Writes tasks, graders, and suites that measure real capability without becoming training targets.

Zak Elwood
Zak Elwood
OpenAI
Launch gating

Owns thresholds, suite health, and cross-release comparability that drive go/no-go decisions.

Sandra Masha
Sandra Masha
DeepMind
Measurement integrity

Defends eval signal from benchmark contamination, data leakage, and optimization pressure.

Early-Stage
Occasional
Growth
Common
Enterprise
Primary

Dedicated eval orgs at frontier labs (Anthropic, OpenAI, DeepMind). Growth-stage evals are typically owned by researchers or a senior engineer part-time; dedicated teams emerge at scale.

Let’s Find Your Next Builder

If you’re hiring at the AI frontier, let’s talk.