Research · May 2026
How general-purpose LLMs drop constraints across multi-turn booking conversations, and what a specialized small model trained on hotel workflows changes when benchmarked against GPT-4o.
Essay · May 2026
What a year of running evaluation infrastructure across regulated-industry deployments taught us about reliability, eval design, and why production traces matter more than benchmarks.