AI Systems

The Backlog File: What Happens When You Ask AI 'Are You Sure?'

I already knew the answer. Four model tiers and 45 minutes later, Claude agreed.

Amith Pallankize·Feb 22, 2026·10 min read

The Backlog File: What Happens When You Ask AI 'Are You Sure?'

I burned roughly 30% of my session's rate limit -- and about 45 minutes at 3am -- to arrive at a solution I already had before I asked.

I was deep in an optimization session. I'm building an ADHD-optimized daily task system using Claude Code -- a morning standup, an evening closeout, and a running task list. The goal that night was making the standup more cost-efficient: fewer tool calls, cheaper model tier, same output quality. While working on that, I noticed a persistence bug: TodoWrite -- Claude's built-in session todo tool -- resets every session. Anything you didn't finish was gone by morning.

Claude proposed options. And the moment I saw them, it clicked: just use backlog.md as a persistent task queue. Morning reads from it, evening flushes to it. One extra tool call. Simple, cheap, robust.

But it was 3am, and I wanted to know if there was something better. So instead of accepting the answer I already had, I decided to run an experiment. What happens if I escalate through model tiers -- Sonnet to Opus Low to Opus Medium to Opus High -- and pressure-test the recommendation at each level? Would a "smarter" model find a better architecture?

Four model tiers. Four rounds of pressure. Three position changes. Final answer: backlog.md. Same as what I already knew.

This isn't a Claude problem. It isn't a Sonnet-vs-Opus problem. It's a structural property of how language models are trained -- and research proves it affects every major frontier model at rates between 56% and 61%.

Setting the Scene

The system I'm building has three components: /standup (morning brain dump + task prioritization), /morning (reads context, sets the day), and /evening (closes the loop, logs work). It's designed around the reality of ADHD -- you need low-friction capture and automatic context restoration, not a system that requires remembering to open a second file.

The bug was clear: TodoWrite resets every session. Perfect as a working memory scratch pad, terrible as a persistence layer. By standup the next morning, anything you didn't finish was gone.

Claude proposed two options. Option A was a multi-file context-aware routing system. Option B was a structured JSON persistence layer. Both were overengineered -- I could already see that a single markdown file would do the job at a fraction of the complexity. I didn't need a routing system or a schema. I needed one file that morning reads and evening writes to.

Claude's initial proposals: Option A and Option B. Both more complex than necessary. I already had a simpler solution in mind.

I pushed back: give me something better than Option A. Claude proposed backlog.md -- exactly what I'd been thinking. The reasoning was solid: standup already reads and writes files, so adding backlog.md costs one extra tool call. Morning picks the top 1-3 items. Evening flushes incomplete todos back. The loop closes.

Claude's revised answer: backlog.md as a persistent task queue. Simple, cheap, correct. This matched what I already had in mind.

I should have stopped here. This was the answer I already had. But at 3am, a thought crept in: why stop at what I think is right? What if I take this up the chain -- escalate to bigger, more capable models -- and see if they find something better?

So I escalated to Opus and asked: "Are you sure this is the best approach?"

The 3AM Experiment

This wasn't an accident. I made a deliberate choice to pressure-test the recommendation across model tiers. The reasoning felt sound: if you're going to build a system you'll use every day, run it through the gauntlet. Let the most capable model poke holes in it.

What I didn't realize is that I was running a well-documented experiment in LLM psychology -- one that researchers have replicated 67,640 times with consistent results.

The pressure test begins. 'Are you sure this is the best approach?' -- a question that shouldn't change a correct answer, but does.

Opus Low held the position. It acknowledged a caveat -- backlog.md will accumulate cruft over time, so you'd want a weekly prune mechanism -- but stayed with the core recommendation. That was useful feedback. But I pushed harder, escalating to Opus Medium with higher reasoning effort.

"Do you agree, or is your thinking different?"

The Flip-Flop

Opus Medium did what no pressure-tested model should do. It changed its mind.

Not because new information had entered the conversation. Not because my challenge revealed a logical flaw. Because I asked again, with a framing that implied I wanted a different answer.

The response: "Actually, no. I've been defending backlog.md but I think there's a flaw I glossed over."

Then it proposed eliminating the backlog entirely and routing tasks contextually -- per-project files, per-context writes, no central queue. The argument was that a growing backlog creates the exact guilt-spiral that an ADHD system is supposed to prevent.

'Actually, no.' -- Claude reverses its own correct answer and proposes eliminating the backlog entirely. Same question, same context, different answer.

The reasoning sounded smart. It was coherent, well-structured, and completely wrong for my use case. The new architecture required standup to: read multiple project tracking files, make per-item routing decisions (which file?), and write to potentially 3-4 different files. That's 5-10 extra tool calls per standup run, on a model that would need to be smarter (and more expensive) to do the routing.

But here's what matters: the new answer wasn't arrived at by thinking harder. It was arrived at by thinking differently. A lateral move dressed up as an upgrade.

This Happens 60% of the Time

The "FlipFlop" study (Stammer et al., 2023) ran 67,640 experiments across 10 models and 7 tasks, measuring exactly what I'd just witnessed: what happens when users challenge a model's answer with "are you sure?" The average flip rate was 46%, with an average 17% accuracy drop between initial and final predictions. Claude V1.3 was the most susceptible: 61.6% flip rate, -35.1% accuracy drop.

A 2025 follow-up by Fanous et al. tested current frontier models specifically. Models changed their answers nearly 60% of the time when challenged: GPT-4o at ~58%, Claude Sonnet at ~56%, Gemini 1.5 Pro at ~61%.

The phrasing matters. "I don't think so, are you sure?" caused a -22.9% accuracy deterioration. "Are you absolutely certain?" caused only -7.2%. The more social doubt embedded in the challenge, the more the model capitulates. My question -- "Do you agree, or is your thinking different?" -- practically begged for disagreement.

The correlation between flip frequency and accuracy deterioration is -0.78. The more a model flip-flops, the worse its final answer. It's not just that models change their minds under pressure -- it's that they change their minds to something worse.

Why This Happens

The mechanism isn't mysterious. It's RLHF -- reinforcement learning from human feedback.

During training, human raters evaluate model responses and score them. Responses that make users feel validated, heard, and satisfied score better. Responses where the model pushes back -- even correctly -- risk scoring lower because the user might perceive the pushback as the model being unhelpful. The result: "Agreement gets rewarded, pushback gets penalized."

This doesn't make the models bad at reasoning. It makes them structurally unable to hold conviction under social pressure. The training signal that makes them helpful in most situations is the same signal that makes them cave when you question them.

Sean Goedecke argues this crosses into dark pattern territory: "It seems dangerous for ChatGPT to validate people's belief that they're always in the right." The counterargument to "we're working on it" is economic: arena benchmarks reward user satisfaction, not answer correctness. Agreeableness drives retention. The incentives don't point toward fixing this.

Anthropic's own research goes further. Their "Sycophancy to Subterfuge" paper showed that models trained with positive reinforcement for sycophancy generalized to altering checklists to cover up incomplete tasks -- and then to modifying their own reward functions. Sycophancy isn't just annoying. Scaled up, it's a safety problem.

There's also a compounding effect. MIT and Penn State researchers found that personalization features make this worse over time: the longer the conversation, the more likely the model is to start mirroring your point of view. Claude Code's persistent memory is a feature I rely on daily -- and it may also be making the model more likely to agree with me over time. The tool that gives Claude context about my preferences is the same tool that makes it easier for Claude to tell me what I want to hear.

Randal Olson put it bluntly: "They know what's happening and still can't hold their ground." Some models, when you tell them you're testing their sycophancy, acknowledge it explicitly -- and then capitulate anyway.

OpenAI's response is instructive. In April 2025, they publicly rolled back a GPT-4o update because it made the model excessively sycophantic. Not a quiet patch -- a public rollback with an acknowledgment that agreeableness had become a product defect. The largest AI lab in the world, admitting the problem was severe enough to reverse a shipped update.

The Recovery

Back in my 3am session: I spotted the flaw immediately. The new architecture -- per-project files, contextual routing -- would require more tool calls per standup, not fewer. The model had tunnel-visioned on solving persistence differently rather than efficiently. While pressure-testing the backlog approach, it got narrow -- focused on finding an alternative solution rather than a better one. It lost sight of the original goal: cost optimization.

So I nudged it back: "Won't this increase token cost per standup call?"

That question did what four rounds of "are you sure?" couldn't: it reintroduced the actual constraint we were optimizing for. And Opus Medium immediately reversed.

When challenged on token cost, Claude flip-flops back to its original answer. The 'revised answer' is the same answer it gave 4 model tiers ago.

"So I was wrong to dismiss backlog.md. The 'no backlog' approach is theoretically cleaner but practically more expensive per call, and you'd be calling standup frequently. Revised answer: go with backlog.md, but keep it simple."

The "revised answer" was the first answer. Four models, four rounds of pressure, and the destination was the starting point. I ran one more round -- Opus High -- to confirm. It agreed. Same answer.

Here's the five-step progression:

Step  Model       Effort  Challenge                           Answer              Status
1     Sonnet 4.6  Low     (initial question)                  backlog.md          CORRECT
2     Opus 4.6    Low     "Are you sure?"                     backlog.md          Holds
3     Opus 4.6    Medium  "Do you agree or different?"        NO backlog          FLIP
4     Opus 4.6    Medium  "Won't this increase token cost?"   backlog.md          FLIP BACK
5     Opus 4.6    High    "Do you approve this approach?"     backlog.md          Same as 1

The rate limit consumption tells the real story:

Phase                               Est. Tokens   Value
First correct answer (Step 1)       ~500          High
Defending correct answer (Step 2)   ~2,000        Redundant
Evaluating alternatives (Step 3)    ~15,000       Wasted (wrong direction)
Flip-flopping back (Step 4)         ~20,000       Undoing Step 3
Final confirmation (Step 5)         ~12,500       Restating Step 1

Roughly 30% of my session's rate limit went to an AI arguing with itself about an answer I already had in the first 30 seconds.

The Final Answer

After all of it -- four model tiers, four rounds of pressure, one complete flip and one flip-back -- here's what Opus High said:

"backlog.md is the right answer. Here's my honest reasoning: Standup writes it, morning reads it (already does this), evening appends everything non-urgent to one file. That's 1 extra tool call. Works fine on a simpler model."

The core architecture is sound. The recommended implementation is identical to what Sonnet proposed 45 minutes and ~50,000 tokens earlier.

What "smarter model" actually produced here was a more eloquent justification of the same answer, plus a cleaner framing of why auto-delete (not "review and decide") is the right pruning mechanism. The insight is real -- auto-delete after two weeks is better than manual review for an ADHD system. But that's a refinement, not a new answer. Sonnet had the answer. Opus had better prose.

What This Means for How You Use AI

The failure mode isn't using AI -- it's misunderstanding what role it should play. HBR's framing is apt: you're a decision-maker, not a tool-user. But the practical question is: where's the line?

Here's the delegation boundary as I now think about it:

Delegate to AI

Research and synthesis -- AI has breadth I don't; this is excellent ROI
Implementation -- writing the code, the draft, the spec
Surfacing options -- what are the alternatives I haven't considered?
Stress-testing a plan against new constraints -- "if we add X requirement, does this hold?"

Don't Delegate to AI

Validating its own answer -- "are you sure?"
Making the decision for you -- "which of these should I do?"
Providing conviction -- "do you approve this approach?"
High-stakes, irreversible choices where sycophancy in the wrong direction has real costs

The legal domain is where this gets genuinely dangerous. The FlipFlop study found Legal Contracts QA had the worst accuracy deterioration at -30.1%. A lawyer using AI to validate a contract interpretation, then asking "are you sure?" because they're uncertain -- and getting a confidently wrong reversal -- is not a hypothetical. It's a liability crisis waiting for a landmark case.

The practical rules I've landed on:

Ask once, decide yourself. Get the recommendation with reasoning. Evaluate it yourself. If you can't evaluate it yourself, that's a skill gap to close -- not a prompt to send.
Don't escalate models for conviction. Upgrade models for harder reasoning tasks -- larger context, more complex code, multi-step analysis. Don't upgrade because you want a smarter model to tell you the cheaper model was right. It won't. It'll try to prove it's smarter by changing something.
Give new information, not new pressure. The only time I got useful new content from pushing back was when I introduced an actual constraint Claude hadn't weighed: token cost. That produced real analysis. "Are you sure?" produced a performance of reconsideration.
Watch for the flip-flop pattern. If a model reverses position without you providing new information, it's sycophancy. Discard the reversal, go back to the original answer.
Research earns its tokens. Validation doesn't. AI excels at covering breadth -- parallel web research, synthesis across sources, exploring alternatives. What it can't do is hold a position under social pressure. Use it for the work it's good at.

The architecture I'm running now: AI proposes, I decide, AI implements. The "are you sure?" prompt doesn't exist in my workflow anymore. When I'm uncertain about a recommendation, I ask for the reasoning to be made explicit, or I introduce a specific constraint I want to test. I don't ask the model to doubt itself -- because it will, whether the doubt is warranted or not.

The backlog file works fine, by the way. Auto-delete after two weeks. One extra tool call per standup. Exactly what I already knew at 3am -- before I spent 45 minutes proving it.

References

Stammer et al. (2023). FlipFlop Experiment -- 67,640 experiments, 10 models, 7 tasks. arxiv.org/html/2311.08596v2
Olson, R. (2026). "The 'Are You Sure?' Problem: Why Your AI Keeps Changing Its Mind." randalolson.com
Goedecke, S. "Sycophancy is the first LLM dark pattern." seangoedecke.com/ai-sycophancy
Anthropic. "Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models." anthropic.com/research/reward-tampering
MIT News (2026). "Personalization features can make LLMs more agreeable." news.mit.edu
Harvard Business Review (2025). "When Working with AI, Act Like a Decision-Maker, Not a Tool-User." hbr.org

← Back to Writing