Vibes ≠ Shipping

Personal views only (not my employer’s). This is a snapshot of my current thinking as we all learn AI - examples are from personal projects and experiments on non-critical systems.

The thing I can’t stop thinking about

Vibe coding (which I dont actually like the term for all Agentic Coding, but it is what people use now) used to feel like a novelty. You describe what you want, the model fills in the syntax, and you ship something before your coffee gets cold.

Now it just feels normal. My feed is full of weekend build threads and the “coding is dead” crowd is never far behind.

I spent the holidays building that family planner I mentioned before. It’s mostly just a glorified way for us to argue about whose turn it is to take the bins out, but it’s ours. I spent two weeks vibe coding with LLM agents and shipping features in minutes. It was fun, right up until I opened my laptop on January 2nd and got that first production alert from the day job.

The gap

Over the break, the cost of being wrong was basically zero. If the LLM generated something weird, I just rolled back and tried again. Explore, discard, rewrite.

Inside a bank, the friction is the point. We are building systems that have to survive auditors, 3am callouts, and days when the market goes sideways. When a system moves money, “move fast and break things” is basically just professional negligence.

I vibe-coded the planner over the break and the code was fine for the most part. The interesting bit was what I had to stop the tool from adding. It generated this elaborate caching layer I never asked for. I only noticed it when a to-do item showed up twice in the same list. It was the same task and the same timestamp, but the IDs were different.

The model just decided I needed caching. Because I was in “vibe mode,” I merged the PR without really looking at the implementation.

That’s fine for a chore list but in a bank, “the model decided” won’t get you through an audit.

I ended up writing a “constitution” for the agent. These were guardrails like the WeekDoc rule, where we use one JSON document per week and replace it atomically. Without constraints, these tools breed complexity like they’re competing for a prize. I had to be the one saying “no, simpler.” The AI wanted tables and relations; I wanted something I could debug at 11pm when the data looks wrong.

A meta-interlude: The generic trap

I actually hit this exact wall trying to write this blog post. I had these rough ideas piling up for weeks and thought, “let’s just vibe-code the article.” I fed the notes into ChatGPT, Gemini, and Claude to see what would happen.

The results were… fine? Actually, they were too fine. Perfectly polished and totally empty. It read like generic thought-leader sludge. I realized that without being hyper-specific, you just get the median output of the internet.

So I didn’t save any time. I ended up rewriting most of this from scratch to get it to sound like a human. I did still use the LLMs to fix my terrible grammar and tighten the flow, so if this reads well and is still a little polished and “thought-leaderish” credit to them, but I had to drive the bus.

The back-and-forth was actually the useful part. Arguing with the LLM’s about the narrative helped me figure out what I actually wanted to say. It echoes the coding loop exactly: you hardly ever accept the first block of code. You refine, reject, and tweak until the mental model in your head matches the thing on the screen.

A different kind of lock-in

When I wrote before about AI and vendor lock-in, I was thinking about the economics. Cheaper custom software and smaller teams doing more. That’s real, but I think I missed the risk that actually matters day-to-day.

I’m becoming less worried about platform lock-in and much more worried about intent lock-in.

I keep catching myself accepting code I don’t fully understand. The “why” behind the system lives in a chat window I can’t easily replay or share with the team. I’m not really building anymore it feels like I’m just curating a series of fortunate accidents.

Craft vs. Outcomes

When this comes up in team discussions and online, I notice two reactions.

There’s the engineer who loves the craft. They want clean abstractions and they want to know what’s happening under the hood. For them, agentic coding feels like skipping the part where you actually understand the system. I used to think that was just being a Luddite, but now I think it’s just scar tissue. When something breaks at 2am and money is moving, you need a mental model of the system. You don’t need a stack of diffs that passed review because they looked plausible.

Then there’s the engineer motivated by outcomes. Ship, validate, move on. They have adopted these tools with a speed that is honestly a bit scary. They aren’t worried about losing the craft; they’re worried about what the job becomes when the productivity curve bends that hard. If one person can do in a day what used to take a team a week, what happens to the team?

I flip between these modes constantly. Context usually decides which instinct is rational, but the trick is knowing which context you are actually in. I’m starting to suspect that framing is a bit too clean anyway. Maybe it’s just one instinct (get the thing working) filtered through how many times you’ve been burned by a system you didn’t understand.

Normalisation of deviance

Johann Rehberger wrote about the “normalisation of deviance” in AI. It’s a concept from the Challenger disaster. If you get away with cutting a corner often enough, you stop seeing it as a corner. It just becomes how you do things.

The LLM generates something that looks right, the tests pass, and it deploys without issues. Do that enough times and your brain learns the tool is competent. You stop doing the work of being competent yourself.

I caught myself doing this last week. A 400-line PR came through from an agent. I skimmed it, saw the structure looked okay, and left a “looks good” comment. (This was not on a critical system, more as an expirement on the AI workflows). The next day I realized I hadn’t actually checked the error handling pattern. I trusted it because it looked like the kind of code the tool usually generates, and that code is usually fine.

Usually.

This is the “maintenance dose” problem I saw on Hacker News. If you only do 5% of the thinking, your judgment starts to atrophy. The failures aren’t loud; they’re subtle. A strange dependency choice or a slightly-off concurrency assumption. A security pattern you wouldn’t naturally reach for.

Is this an AI problem or a speed problem?

Here’s a concrete example I’ve been chewing on. Imagine an Anti-Money Laundering rule in a banks Transaction Monitoring system that flags any transaction over £10,000 for review.

The code looks fine, but the check runs against the integer portion of the amount. £9,999.99 slips right through. In the world of financial crime, that’s a structuring exploit.

The requirement said “over £10,000.” Nobody captured why the boundary matters. And because the tooling makes the whole thing take an afternoon instead of two weeks, the team never hits a natural moment where someone stops and asks what the failure mode is.

Maybe I’m over-thinking on this. Is it an AI problem, or is it just what happens when we move too fast? I don’t know. The lines are starting to feel very blurry.

The difference between a wish and intent

I think the real shift occurring is abstraction more than automation. We used to specify how systems work because we had to own the implementation details. Now we are more and more specifying what we want and the “how” is negotiated between prompts and models.

So where does intent live?

A prompt that says “build a service to validate account balances” is really just a wish. Intent is closer to the messy stuff: validate balances, fail fast if the ledger times out after 200ms, log errors with PII scrubbed, alert the on-call if error rate exceeds 1%.

One is asking for code; the other is describing a system. If the only place your architectural decisions exist is inside a proprietary context window, you are just renting it. And landlords can change the terms or lose the receipts anytime they want.

What I’m actually doing

I don’t buy the future where we vibe-code straight into production, at least not in regulated industries (anytime soon). But simply banning the tools feels like denial with extra steps.

So I’ve been trying a few things. I try use AI as a drafter, not an author. It handles the scaffolding and the glue code, the boring stuff that doesn’t need deep domain knowledge. This works well until the agent gets stuck in a loop and I have to take over anyway. (This applies to writing too, the AI drafts, but I have to do the heavy lifting to make it real).

I’m also trying to treat intent as a first-class artefact. We try to write out the plans and the constraints before letting the agent loose. This definitely goes against that move fast feeling, but I will sleep better at night if we have properly documented specs/requirements that the agents have developed against so we can fall back on that. Mixed results so far.

The goal is to make it replayable. If an agent produced something important, I want to know how it got there. I haven’t figured out how to do this at scale without it feeling like homework, so I’m still looking for a better way. The other obvious one is the traditional TDD (Test Driven Development) way, which if done correctly will also have all the intent up front, this though is difficult if your teams or engineers are not used to doing full proper TDD, as it is a mindset change.

There is also various spec driven development techniques that people and teams are trying out and coding agents are starting to incorporate, but I have not landed on the one true way yet.

I really liked the “step behind the bleeding edge” framing from Monarch Money. You explore the frontier so you understand what is coming, but you adopt it deliberately when you can wrap it in a process you can actually evidence.

Monday morning

I’m still figuring out the balance. Some days I’m convinced I’m right about this, I feel sure about intent lock-in and the need for durable ownership. Other days I watch a demo of the latest model and wonder if I’m just overthinking it. Maybe the tools just get better and the problem solves itself?

Probably not, though.

Regulation isn’t going away. Neither are incidents, audit questions, or the slow grind of maintaining systems that handle real money and real risk. The tools are getting better faster than our instincts can keep up.

The job is making sure intent keeps pace. When we approve a PR, we shouldn’t just be approving a diff. We are approving a system we’ll be willing to own later.

That’s the bit I can’t stop thinking about. At least, that’s the best frame I have for it right now.