Research and Articles

Case Study · May 2026

Training a specialized hotel booking model

How general-purpose LLMs drop constraints across multi-turn booking conversations, and what a specialized small model trained on hotel workflows changes when benchmarked against GPT-4o.

Essay · May 2026

Lessons from evaluating AI in production

What a year of running evaluation infrastructure across regulated-industry deployments taught us about reliability, eval design, and why production traces matter more than benchmarks.