Hand-Crafted Skills Still Beat Auto-Generated Ones. The Research Proves It.

There’s a paper making the rounds on alphaXiv called SkillsBench — the first systematic benchmark for how well “agent skills” (structured procedural knowledge packages) actually work when you plug them into LLM agents. The headline result: human-curated skills improve agent performance by 16.2 percentage points. Self-generated skills? Negligible gain. Optimal setup: two to three concise, well-crafted skills per task.

I didn’t read the paper first. I arrived at the same conclusion by building an intelligence pipeline over the past week — an AI system that scans my newsletter subscriptions across eight Gmail label categories, extracts signals tagged to my business entities, and produces a daily brief. The process was iterative: I’d run the system, notice what it missed, adjust the skill instructions, and run again. Every improvement came from me noticing a gap and writing a better directive — not from the agent optimising itself.

The agent couldn’t figure out that alphaXiv papers were relevant to my work. It couldn’t decide that McKinsey’s volume-over-signal ratio warranted manual curation instead of automatic ingestion. It couldn’t spot that individual Medium subscription emails were duplicating the daily digest. Each of those was a human editorial judgment that made the system materially better.

This maps precisely to what SkillsBench found: the bottleneck in agentic systems isn’t model capability — it’s skill design. The models are powerful enough. What they lack is the taste layer: knowing what matters, what’s noise, and what connections to draw across domains. That layer is still, stubbornly, a human function.

The implication for anyone building agent systems: invest your time in crafting better skills, not in hoping the model will figure it out. The 16-point gap isn’t closing soon. And if you’re running a multi-agent orchestration stack, the quality of your skill definitions is now your primary competitive variable — not the model you’re running underneath.

We’re in the era of the artisan prompt. The irony is thick, and the returns are real.