
Every week, I get the same question in my inbox.
"Should I switch from ChatGPT to Claude?" "Is Gemini better for SEO copy?" "What model should I use for my email sequences?"
Wrong question.
A team of researchers from Stanford and MIT just published a paper that should stop the "which model" debate. They proved, with hard numbers, that the code wrapping your AI model creates a bigger performance gap than the model itself.
6x bigger.
Same model. Same task. Same data. Change only the wrapper code, how the AI stores information, what examples it retrieves, how it structures the prompt, and performance swings from terrible to 6x better.
What Stanford Actually Found
The paper is called "Meta-Harness." Researchers took a fixed AI model and asked it to classify text across legal documents, medical symptoms, and chemical patents. Then they systematically changed only the harness; the surrounding code that decides what context to feed the model, how to retrieve past examples, and what information to present.

Giphy
The results:
+7.7 point improvement, while using 4x fewer input tokens (meaning it was cheaper AND better)
On advanced math problems, a optimized harness improved accuracy by 4.7 points across five different models it had never been tested on
On TerminalBench-2 (a competitive benchmark where teams spend months hand-tuning their agents), Meta-Harness's automatically discovered a configuration that beat every hand-engineered Haiku 4.5 agent and ranked #2 among all Opus 4.6 agents
Please read that again. An automated system that optimized the wrapper beat top teams of engineers who spent months hand-tuning their setups.
And it took a few hours, not months of engineering time.
The Harness Problem, for Marketers
Your AI harness is everything except the model. It's the system prompt. The examples you feed it. The memory of past interactions. The logic that decides which reference materials to surface. Your instructions. System guardrails.
When you open ChatGPT and type "write me a cold email," you're using a harness. A bad one. Zero context about your ICP. No examples of what's worked before. No brand voice reference. No memory of the 47 emails you've already written.
When you save a Claude skill file with your brand voice, top-performing examples, and step-by-step process, that's a better harness. Same model, radically different output.
The harness isn't just the prompt. It's the entire decision architecture around the prompt. The Stanford team found that access to full execution traces, the complete history of what worked and what didn't, was the single most important ingredient. Not summaries. Not scores. The raw AI receipts.
They ran an ablation study to prove it. When the system could only see scores, median accuracy was 34.6%. When it could also see summaries, 34.9%. When it got access to full execution traces? 50.0%. That's a 44% improvement just from giving the AI access to its own history.


Don’t Let Critical Knowledge Walk Out the Door
Retirement isn’t the only risk; promotions, turnover, and restructuring are quietly draining your organization’s most valuable asset: institutional knowledge.
Poor succession planning has been linked to $1 trillion in lost market value annually.
Join Alec for UVM’s Knowledge Transfer & Succession Planning Certificate for Leaders and learn how to stay ahead of the loss.
In this 4-week, 100% online program (April 27 – May 24), you’ll learn how to:
Capture critical knowledge before it disappears
Build cross-generational mentorship systems
Use AI to scale knowledge transfer, without more meetings
Walk away with a ready-to-implement knowledge retention plan.
Why This Matters More Than Model Selection
A developer named Can Boluk ran a parallel experiment earlier this year. He changed only the edit format, one component of the harness, across 15 different LLMs. Performance improved by 5 to 14 points on every single model. One model jumped from 6.7% to 68.3%. Same model. Different wrapper. 10x improvement.
LangChain's team got similar results. They optimized only the harness on TerminalBench 2.0, jumped 13.7 points without changing the model, and leapfrogged from Top 30 to Top 5 on the leaderboard.
The pattern is impossible to ignore. Models have converged. Claude, GPT, Gemini; they perform within a narrow band of each other on most marketing tasks. The model is not your competitive advantage anymore.
The system you build around it is.

Giphy
Martin Fowler, the godfather of software architecture, wrote a piece on this shift. The field even has a name now: harness engineering. It's the practice of systematically improving the code around the model to maximize performance.
And almost no one in marketing is doing it.

The Practical Playbook: 4 Harness Fixes You Can Make Today
This isn't just theory. Here's how to apply harness engineering to your marketing AI workflows right now.
1. Stop Cold-Starting Every Session
The Stanford team's biggest finding: AI systems that remember their full execution history massively outperform systems that start fresh. Yet most marketers open a new chat window every time and retype their context.
Fix: Build a persistent context document. Brand voice, ICP descriptions, top-performing ad examples, common objections, brand history, your preferences. Load it first, every time. If you're using Claude Code, this is literally what CLAUDE.md files do. If you're on ChatGPT, use Custom Instructions or a pinned project file.
2. Use the Draft-Verify Pattern
One of the harnesses Meta-Harness discovered automatically was a two-stage classification system: make a first guess, then retrieve evidence both for and against that guess before making a final decision.
Apply this to: Email subject line testing, ad copy selection, and content categorization. Don't ask the AI to give you the answer once. Ask it to give you a draft answer, then challenge its own reasoning with counterexamples. Two calls that take 5 seconds total will outperform one call that you spend 10 minutes prompt-engineering.
3. Feed Examples, Not Adjectives
The research showed that the best harnesses used contrastive pairs, examples that are similar but have different labels. "This email converted at 4.2%, this nearly identical one converted at 0.3%, here's what's different."
Stop saying: "Write in a witty but professional tone."
Start doing: Drop in 3 examples of emails that hit your target metrics and 2 that flopped. Let the model learn the pattern, not interpret your adjective.
4. Build Your Own Execution Memory
The paper's killer feature was the filesystem; a growing archive of every attempt, every score, every trace. The AI proposer read a median of 82 files per iteration, learning from every past failure.
You probably can't build that system tomorrow. But you can start logging. Save every AI output that worked. Save every one that didn't, with a note about why. Build a "swipe file" the AI can reference. Over time, this becomes your institutional memory, and the research says it's the single biggest performance lever.

My Lesson
Here's the thing that keeps nagging me about this paper.

Giphy
The system that won wasn't the one with the best model. It wasn't the one with the cleverest prompt. It was the one that learned from its own failures and adapted its own infrastructure.
The AI figured out what most marketers haven't: stop rewriting your prompts. Start building better systems around them.
The 6x gap is real. And right now, while everyone argues about GPT-5 vs. Claude Opus vs. Gemini 3, the people building better harnesses are quietly running laps around them.
Game on.
Cheers,
Alec



