Agentic Engineering Weekly for March 28 - April 4, 2026
The industry spent this week naming things it couldn't articulate before. They point at structural problems: the debts AI coding creates that aren't in the code, the paradox of agents that solve complexity by generating more of it, and the moment "harness engineering" stopped being a blog post and became a discipline.
My top 3 picks this week
- Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI: everything is a skill issue, agentic primitives, auto-research (video)
- An AI state of the union (Simon Willison on Lenny's Podcast): whenever Willison speaks, you should listen. Such a great person to have along this wild ride (video)
- Harness engineering for coding agent users: definition and a great feedforward/feedback decomposition that gives teams a shared vocabulary (article)
Last video
Three debts, not one: code, cognition, and intent
We've spent years talking about technical debt as the central liability of software systems. AI coding tools are remarkably good at paying it down: refactors that took weeks now take minutes. But something uncomfortable is happening in return. Teams are accumulating two new forms of debt that live outside the codebase entirely, and we're only just finding the words for them.
Cognitive debt is the erosion of shared mental models. Peter Naur argued decades ago that a program is not its source code but the understanding in the heads of its makers. When agents generate large swaths of implementation, that understanding never forms. Story after story follows the same arc: vibe-coding goes well until you hit a wall, and no amount of prompting gets you past it, because nobody on the team has an internal model of how the system actually behaves. The countermeasure isn't "read more code." It's deliberate practices that rebuild understanding: component walkthroughs, generated architecture visualizations, and occasionally the radical move of regenerating a component from scratch specifically to rebuild the team's mental model.
Intent debt is subtler: the absence of captured rationale. Why was this boundary drawn here? What trade-off was made and what was rejected? When decisions get made in a prompt and the agent executes them, the reasoning evaporates. Six months later, nobody can tell whether a design choice was deliberate or an LLM default. The fix is intent-first workflows: ADRs, spec-driven development, BDD. Artifacts that let both future humans and agents make informed changes instead of guessing.
Worth reading:
- From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI, Margaret-Anne Storey: light read defining the different kinds of debt and pointing toward solutions (paper)
- The Fighter Pilot Fallacy: Why AI Demands Skills We Don't Have: Berkeley research shows managers consume AI efficiency gains by piling on assignments, repeating the Jevons Paradox from the telegraph era (article)
- Ivett Ordog - Managing Cognitive Load in the Age of AI: A Samman technical coach argues the bottleneck isn't prompts or AGENT.md, it's cognitive load management and habits (video)
Stripe ships 1,300 agent PRs per week: dark factories aren't coming, they're here
Dan Shapiro coined the term dark factory for the endgame of agentic engineering: a software production process so automated you can turn the lights off. This week, Stripe showed us what that looks like at scale. Their internal "minions" system ships approximately 1,300 pull requests per week with minimal human coding. That's not a research demo. That's production infrastructure at one of the most demanding engineering organizations on the planet.
What makes the Stripe story worth studying isn't the raw throughput. It's the preconditions. Their years of investment in developer experience, comprehensive documentation, blessed paths for common tasks, and robust CI/CD didn't just make human engineers faster. It made agents dramatically more successful. Clear docs on "how to add a new API field" translate directly into agent instructions. The lesson: organizations that invested in developer experience for humans are now reaping compound returns from agents. Organizations that didn't are stuck teaching agents to navigate the same chaos that frustrated their human developers.
Simon Willison frames this as a practical threshold crossing: late-2025 models made agentic engineering loops reliable enough to iterate without constant babysitting. The bottleneck shifted from writing code to specification, judgment, testing, and security. The control surface for software quality is moving from manual inspection toward simulation, testing, and system-level verification. We're not there yet for most teams. But the trajectory is no longer theoretical.
Worth reading:
- Dan Shapiro's five levels coining the term "dark factory": yet another riff on the AI coding maturity ladder, according to my research this one coins the "dark factory" term (article)
- Stripe's "Minions": How AI agents write 1,300 PRs weekly: The most concrete dark factory case study available right now, with real numbers and architecture details (video)
- An AI state of the union (Simon Willison on Lenny's Podcast): Great listen. Willison draws a hard line between vibe coding and professional agentic engineering, with sharp security warnings (video)
- Software development now costs less than a minimum wage worker: A cold, game-theoretic write-up, including the PE firm that shorted Atlassian because of Ralph (article)
Harness engineering graduates from blog post to discipline
Three independent publications landed in the same week, and the convergence is the signal. Birgitta Boeckeler published a mental model for building trust in coding agents through feedforward guides and feedback sensors. LangChain broke down the anatomy of an agent harness into composable components. And Addy Osmani documented orchestration patterns from subagents to full agent teams with quality gates. When Fowler's bliki, an AI-native framework company, and a Google DevRel lead all arrive at the same concept independently, it's time to start paying attention folks.
The practical implications are already sharp. We're finding that LLMs can realistically follow somewhere between 100 and 500 instructions before performance degrades. In a world of AGENTS.md files, custom skills, and layered system prompts, that instruction budget compounds fast. The temptation is to keep adding: more rules, more guardrails, more context. But the winning move is ruthless curation. Build your harness, evaluate it with and without each component to measure actual impact, and expect to deprecate parts of it as models improve and your understanding deepens. Today's harness is tomorrow's legacy code.
The deeper shift is from vibes to measurement. Jessica Wang at the Coding Agents Conference put it bluntly: without real eval datasets, scoring, and experiments, you're just guessing whether your agent setup actually works. "Shipping on vibes" is how AI breaks. The teams that treat their harness as an engineering artifact, with tests, metrics, and iteration cycles, will outperform the ones who treat it as a config file they tweak by feel.
Worth reading:
- Harness engineering for coding agent users: The canonical framing, with feedforward/feedback decomposition that gives teams a shared vocabulary (article)
- The Anatomy of an Agent Harness: Compositional view of harness components, useful for teams building their own (article)
- Stop Shipping on Vibes - How to Build Real Evals for Coding Agents: Jessica Wang's CodeCon keynote on why eval infrastructure matters more than prompt tweaking (video)
The agentic tar pit: every greenfield becomes brownfield
Agents are excellent at obliterating Fred Brooks' accidental complexity: the boilerplate, the scaffolding, the ceremony that exists because humans are slow at typing. But they re-introduce new accidental complexity in its place: wrong abstractions, defensive boilerplate, overcomplicated solutions that no human would have chosen. Left untouched, agents choke on their own bloated codebases. Every greenfield project, given enough unsupervised agent iterations, becomes brownfield.
Wes McKinney called this the Mythical Agent-Month at the O'Reilly AI CodeCon, and the framing is sharp. Brooks told us adding people makes late projects later. McKinney extends the argument: adding agents makes messy projects messier, because agents can't distinguish accidental from essential complexity. They treat everything as a problem to solve by generating more code. When code generation is effectively free, the only defense is product taste: the discipline to say no. Every feature agents produce is free to create but not free to maintain. Each one adds surface area for bugs, confusion, and future agentic mistakes.
Karpathy's framing complements this from the operator side. When things go wrong with agents, he argues it's an orchestration failure, a skill issue in the human operator, not a model failure: bad instructions, poor memory setup, weak task decomposition. The bottleneck has shifted from generation to verification. The more agents you run, the more you need to invest in verification infrastructure. Guard rails are not optional, they are the name of the game. But build them with the expectation that they'll be temporary. Model capability improves, harnesses like claude code expand, buliding blocks get commoditized. Protip: check out Wardley mapping if you're a leader struggling with making these kinds of investment choices today.
Worth reading:
- Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI: The strongest articulation of why agent failures are operator failures, plus his auto-research vision for machine-verifiable domains (video)
- AI Unmasked Our Work as Scaffolding: A sobering look at how much of our daily work was just maintaining state, not creating value (article)
Teach agents first, coach humans alongside
Here's a reframe I've been sitting with: if you're a technical coach, your highest-leverage move right now might not be teaching people directly. It might be encoding your expertise as an agent skill first. The skill becomes a reusable asset the next time you coach someone on the same topic. You can guide a learner through the material with the script. The learner can study the concept interactively and autonomously. And your coding agents can leverage it immediately if it benefits the harness. One artifact, four returns.
This doesn't mean coaching becomes a solo act between a person and a chatbot. The human coach still matters enormously, especially for the undercurrent work that no agent can do: understanding why someone resists a new practice, facilitating navigation of the grief of forced change of identity, meeting people where they actually are rather than where you think they should be. Forcing AI adoption is a losing strategy. At best you get compliance, at worst conflict. Neither leads to the genuine curiosity that adoption requires. The coaching fundamentals haven't changed: invite over inflict, lead by example, amplify what's already working.
TechWolf open-sourced their AI-first bootcamp this week, which built AI champions across HR, marketing, product, and finance. It's a useful reference for how to structure hands-on training that crosses disciplinary boundaries. Meanwhile, Kent Beck's Genie Sessions showed what it looks like to encode a specific practice (TCR: Test && Commit || Revert) as an agent skill. The pattern is emerging: coaching expertise becomes agent-consumable, and the coach's role shifts from being the delivery vehicle to curating and verifying the delivery system.
Worth reading:
- Why We're Open-Sourcing Our AI-First Bootcamp (TechWolf): A practical template for cross-functional AI training, open-sourced with the exercises and data (article)
- Genie Sessions: TCR Skill (Kent Beck): Beck encoding Test && Commit || Revert as an agent skill, a concrete example of practice-as-artifact (video)
- Why artificial intelligence is no substitute for real learning: The centaur student paradox: AI supercharges experts but can stop beginners from learning anything (article)
- Cat Hicks' learning opportunities skill: A coding agent skill that reminds you when it's time to stretch that ole' brain muscle of yours after churning out 10KLOC (github)
Quick Hits
- Open Models have crossed a threshold: GLM-5 and MiniMax M2.7 match closed frontier models on agent tasks at a fraction of the cost (article)
- Programming languages for AI: Mark Seemann on which languages are best suited for LLM-based generation (article)
- Starving Genies: Kent Beck on what 3X (Explore/Expand/Extract) says about throttling AI (article)
- Every layer of review makes you 10x slower: Communication overhead math applied to code review in the age of AI throughput (article)
- You're Loading 66,000 Tokens of Plugins Before You Even Type: Token management as an underrated indicator of AI fluency (article)
- Talking Lat.md With Yury Selivanov: Armin Ronacher explores a project combining spec-driven development with a knowledge graph (video)
- Claude Dispatch and the Power of Interfaces: We often lack the interface for the job, even when the AI is capable enough (article)
- LLMs are a technological leap without a ramp: The career opt-out you might not know you're making (article)
- The Cognitive Dark Forest: The open web with AIs is turning into a dark forest (article)
Curated from 229 sources across articles, podcasts, and videos. Week of March 28 - April 4, 2026.