AI Evaluator
first conversation is free, sign up to message Nathan
AI evaluation specialist with a focus on stress‑testing, benchmarking, and improving large language models for real‑world use. I design structured evaluation rubrics, run systematic test suites, and perform deep error analysis to surface failure modes across reasoning, safety, and factual accuracy. My work combines hands‑on prompt experimentation, red‑teaming style probing, and quantitative metrics (e.g., accuracy, F1, calibration and fairness measures) to give model and product teams clear, actionable feedback. I’m comfortable translating complex evaluation findings into plain language for stakeholders, and I enjoy iterating directly with engineers, researchers, and policy teams to align models with product, compliance, and user experience goals
AI evaluation specialist with a focus on stress‑testing, benchmarking, and improving large language models for real‑world use. I design structured evaluation rubrics, run systematic test suites, and perform deep error analysis to surface failure modes across reasoning, safety, and factual accuracy. My work combines hands‑on prompt experimentation, red‑teaming style probing, and quantitative metrics (accuracy, F1, calibration and fairness measures) to give model and product teams clear, actionable feedback. I’m comfortable translating complex evaluation findings into plain language for stakeholders, and I enjoy iterating directly with engineers, researchers, and policy teams to align models with product, compliance, and user experience goals