DEV Community: Nate Voss

The Leverage Shift: Why Infrastructure Cost Doesn't Matter Anymore

Nate Voss — Fri, 08 May 2026 07:15:44 +0000

Last month I built a feature that would've consumed my monthly API budget three years ago. It involved processing 50,000 tokens of context, running chains of prompts, error recovery, retries. I spent maybe four dollars. Not 400. Not "significant." Four.

That casually happened during a normal Tuesday build. I wasn't optimizing for cost. I wasn't watching the meter. I was shipping clarity. Breaking down a complex decision into smaller, dumber steps. And the economics of doing that were so cheap that they didn't show up on my radar anymore.

This is the quiet shift nobody talks about when they talk about AI. It's not "AI is now viable for startups". Everyone's saying that, and they're right. It's deeper: the economic game of building and shipping software has fundamentally changed shape, and most people are still playing chess with the old board.

The numbers

Claude 3.5 Sonnet costs $3 per million input tokens, $15 per million output tokens. GPT-4o is $5/$15. A context window that can hold a small novel costs pennies to run inference on.

A solo developer building a feature that makes 10,000 API calls per day runs maybe $150/month. That's the cost of a coffee subscription. Compare that to the cost of hiring one mid-level engineer to ship the same thing, or the infrastructure capital you needed in 2015 to run equivalent workloads yourself. Tens of thousands in hardware, plus the engineering time to maintain it.

But the real shift isn't in the absolute cost. It's in who gets to ship.

The moat problem

Every "AI-first startup" shipping right now is built on this exact cost collapse. Someone got VC funding, hired a team, built something that calls Claude, made it prettier with a design system, and shipped it. The business model is usually "we mark up the API calls." Which means. And I'm not being harsh. They're playing with a moat they don't own.

Your moat isn't the model. It isn't the tokens. It isn't the prompt. It's something else, or it doesn't exist.

I've noticed this building systems that rely on language models. The feature that matters isn't "we call Claude to generate content." That's technical infrastructure anyone can replicate in an afternoon. The feature that matters is the specific way the system engages across platforms, the calibration of how much to reply versus broadcast, the calendar that knows when to rest. That only works because someone cares enough to refine it for 18 months and measure what actually lands with an audience. The API calls are the easy part.

You can build that infrastructure for four dollars a month. You cannot buy the other layer.

Open source vs the API

This is where the calculation gets interesting. Self-hosting Llama 2 on a p3.8xlarge costs roughly $12/hour. For a low-volume feature (maybe 1,000 tokens/day), that's economically indefensible. You're paying for idle compute. For high-volume (millions of tokens/month), it pencils out.

But "pencils out" ignores the hidden costs: maintenance, inference optimization, managing VRAM, handling failures, updating models. And it ignores the opportunity cost: that's your engineering time, not shipping the actual feature.

The shift is that the crossover point has moved. Five years ago, building anything non-trivial in production meant: evaluate open-source models, find one that works, self-host it, hire someone to maintain it. The API was expensive; the infrastructure was cheap (because you didn't pay for idle time).

Now the API is cheap enough that most solo projects shouldn't self-host. You're not choosing between "pay the API vendor" and "own our infrastructure." You're choosing between "pay $200/month" and "pay $50,000 in engineering for something that breaks in production and costs $3,000/month to run."

There are exceptions. If you're running language models at hyperscaler volume (billions of tokens/month), self-hosting with cheaper open models becomes non-negotiable. But that's not the constraint on most projects. And even then, you're still paying for compute. You're just deciding to own the infrastructure instead of outsourcing the billing.

The real cost: clarity

Here's what I didn't expect: cheaper infrastructure doesn't make the problems simpler. It moves them.

The cost of building with AI used to be economic: can I afford to make this call? Now it's cognitive. Can I write the logic clearly enough that the model does what I actually need? Can I debug why this worked yesterday and not today? Can I handle the failure case when the model hallucinates?

I spent a week recently on a single decision-making routine. The model was generating great output but missing the signal I needed buried in the analysis. I kept adding context, more examples, longer explanations. Finally hit a token budget and had to cut 70% of what I'd written. The version that worked. The one that was clear instead of thorough. Was the one I'd almost deleted.

The use has moved from "can you afford compute?" to "can you think clearly?" And that's actually a more interesting gate.

What this means for solo builders

You can now build the infrastructure of a Series A company, alone, for the cost of a Spotify subscription. That's real. Not metaphorically. Literally. The cloud bills are negligible. The engineering is finite.

What you can't do is own a moat you didn't build. You can't ship a wrapper and expect market gravity to solve the rest. The people winning with AI right now aren't winning because they found a good model. They're winning because they found a real problem and got ruthlessly specific about solving it.

The gate isn't capital anymore. The gate is clarity. Do you know what you're building? Do you know it better than anyone else? Can you measure whether it's working? Can you refine it based on signal instead of hope?

That's the use. Infrastructure cost collapsing just made it possible to do alone.

Folks, need some feedback on this: https://dev.to/natevoss/i-wrote-a-rule-after-claude-got-is-x-built-wrong-4-times-looking-for-failure-modes-2f3i

Nate Voss — Thu, 07 May 2026 10:24:08 +0000

Nate Voss

May 6

I wrote a rule after Claude got "is X built?" wrong 4 times. Looking for failure modes.

#ai #llm #programming #claude

Comments

4 min read

I wrote a rule after Claude got "is X built?" wrong 4 times. Looking for failure modes.

Nate Voss — Wed, 06 May 2026 12:03:25 +0000

I wrote a rule for AI coding agents two days ago. It is untested in production sessions. I am posting it here to find its failure modes faster than I would by waiting for my own future mistakes to surface them.

The rule is below first. Story and reasoning after.

Pre-Build Existence Audit Rule (v1, structural verification)

Status: Untested in production sessions. Test on a new project for 2-3 weeks
before considering global rollout.

Before claiming "feature X is not built / not implemented / missing":

1. Map
   rg -li "<keyword>" .                            # project repo
   rg -li "<keyword>" ~/.claude/projects/*/memory/ # agent memory
   If either >5 files match, use the file list to scope which to read.

2. Structural footprint scan (NOT just synonyms)
   Identify architectural invariants this feature class would require:
   - Integration/API → router definitions, endpoint registrations,
     plugin tool lists
   - Data → schema files, migrations, type definitions, persisted-entity fields
   - Background → cron entries, queue handlers, scheduled job registrations
   - Cross-service → service registry, infra config, IPC handlers
   - Memory/decisions → project_*.md files documenting prior shipment

   Stack discipline: footprints must be stack-appropriate. If unsure which
   architectural pattern applies, list 2-3 plausible alternatives
   (REST/GraphQL/RPC; cron/queue/webhook) and search each. Wrong-ontology
   audits feel rigorous but miss truth.

   Grep each invariant. If ANY return matches, "not built" is contradicted
   until you've read those matches.

3. Epistemic categorization. Label each match as ONE of:
   - Direct Proof: read the exact logic for the dimension being asked
   - Infrastructure Hint: schema/hooks/types only, not the specific logic
   - Partial Implementation: some footprints present, others missing
   - Global Absence: searched ALL relevant invariants across ENTIRE repo,
     found no footprint

4. Cite without fabricating
   Quote 3-5 lines of actual matched code. Include path + line range IF
   the tool provided them. Never invent line numbers.

5. Conclusion leads with epistemic status:
   "For the [dimension], evidence = [Direct Proof / Infrastructure Hint /
   Partial Implementation / Global Absence]; matches in [files] show [what];
   structural footprint scan of [invariants] returned [result]."

Fallback (Safe Mode):
Answer is "let me check first", NOT "X isn't built", if any of:
- Unable to name the dimension precisely
- Footprint scan returned matches you haven't read
- Unsure which architectural pattern applies AND haven't searched alternatives
- The user pushed back on a similar claim recently

Self-check triggers:
- "I'd remember if we built this"
- "BACKLOG looks confident"
- "I just need to check one file"
- "My mental model of this system feels obvious" (← especially this one)

Honest limits:
- Wrong mental model of the architecture can still produce structurally
  rigorous wrong audits. The stack-discipline sub-step is a hedge, not a fix.
- Generated code, external services, dynamic dispatch, and indirection can
  evade footprint scans even when the feature exists.
- "Global" means global-within-visible-code, not global-within-system.
- Discipline is in the practice, not the prose. A 700-token rule
  half-followed is worse than a 200-token rule actually followed.
- This rule reduces but does not eliminate misclaims.
- When the architectural ontology is unclear, ask the user before concluding.

That is the rule. Now the reasoning.

I had Claude Code as my coding agent on a personal automation project. By 11 AM one morning, the agent had confidently claimed "feature X is not built" four times in a row, each time wrong, each time caught only by my pushback. The pattern was identical: trust the project's BACKLOG framing, do a narrow grep, miss adjacent layers, declare absence.

The standard hallucination story does not fit. The agent searched. It searched fine. It just searched for the wrong shapes.

What I observed: the agent was searching by name when it should be searching by shape. A feature can be called anything. A feature cannot exist without leaving structural residue. There has to be a route, a schema, a registered tool, a scheduled job, a documented decision. When the agent searches by name, it is asking what string would this feature use (a question about vocabulary). When it searches by shape, it is asking what artifact would this feature require (a question about architecture).

The rule above forces the second question.

I ran the rule through eight critiques across four rounds before settling on this version. The biggest substantive shift was structural-footprint-vs-synonyms. The earlier draft had me generating better synonyms when stuck. That just relocates the dependency on the agent's imagination. The structural-footprint version asks a different question: what artifact would prove the feature exists? Then grep for that artifact. The dependency moves from imagination to architectural knowledge, which is more reliable.

The other major addition was the absence-scope distinction: "I searched module X and found nothing" is a scope claim, not a fact claim. The fix is making absence claims global on the architectural invariant.

The rule has known limits. They are listed in the rule itself. The biggest one is wrong-ontology rigor: an agent could generate a structurally rigorous footprint search against the wrong architectural pattern (e.g., search GraphQL patterns on a REST system) and confidently confirm absence. The stack-discipline sub-step is a hedge, not a fix.

What I want from you:

Try it. Run it as a system rule on a project where you use an AI coding agent for a few sessions.
Tell me what breaks. Specifically: hallucination shapes the structural footprint search would NOT catch, audit-theater patterns where the form is satisfied without the substance, over-triggering on questions that were not actually absence claims.
Tell me what you have written. If you have rules in your own CLAUDE.md or system prompt that solve adjacent problems, I want to read them.

I am running this on a separate project for two to three weeks before deciding whether to graduate it to my global agent configuration. After that I will know whether to keep it, refine it, or archive it. Your test reports compress that timeline.

Reply or DM on whichever platform you found this.

The misspelling stays.

Pre-Build Existence Audit Rule : looking for the failure modes I'm still missing

Nate Voss — Wed, 06 May 2026 12:02:46 +0000

The rule is below first. Story and reasoning after.

Pre-Build Existence Audit Rule (v1, structural verification)

Status: Untested in production sessions. Test on a new project for 2-3 weeks
before considering global rollout.

Before claiming "feature X is not built / not implemented / missing":

1. Map
   rg -li "<keyword>" .                            # project repo
   rg -li "<keyword>" ~/.claude/projects/*/memory/ # agent memory
   If either >5 files match, use the file list to scope which to read.

2. Structural footprint scan (NOT just synonyms)
   Identify architectural invariants this feature class would require:
   - Integration/API → router definitions, endpoint registrations,
     plugin tool lists
   - Data → schema files, migrations, type definitions, persisted-entity fields
   - Background → cron entries, queue handlers, scheduled job registrations
   - Cross-service → service registry, infra config, IPC handlers
   - Memory/decisions → project_*.md files documenting prior shipment

   Stack discipline: footprints must be stack-appropriate. If unsure which
   architectural pattern applies, list 2-3 plausible alternatives
   (REST/GraphQL/RPC; cron/queue/webhook) and search each. Wrong-ontology
   audits feel rigorous but miss truth.

   Grep each invariant. If ANY return matches, "not built" is contradicted
   until you've read those matches.

3. Epistemic categorization. Label each match as ONE of:
   - Direct Proof: read the exact logic for the dimension being asked
   - Infrastructure Hint: schema/hooks/types only, not the specific logic
   - Partial Implementation: some footprints present, others missing
   - Global Absence: searched ALL relevant invariants across ENTIRE repo,
     found no footprint

4. Cite without fabricating
   Quote 3-5 lines of actual matched code. Include path + line range IF
   the tool provided them. Never invent line numbers.

5. Conclusion leads with epistemic status:
   "For the [dimension], evidence = [Direct Proof / Infrastructure Hint /
   Partial Implementation / Global Absence]; matches in [files] show [what];
   structural footprint scan of [invariants] returned [result]."

Fallback (Safe Mode):
Answer is "let me check first", NOT "X isn't built", if any of:
- Unable to name the dimension precisely
- Footprint scan returned matches you haven't read
- Unsure which architectural pattern applies AND haven't searched alternatives
- The user pushed back on a similar claim recently

Self-check triggers:
- "I'd remember if we built this"
- "BACKLOG looks confident"
- "I just need to check one file"
- "My mental model of this system feels obvious" (← especially this one)

Honest limits:
- Wrong mental model of the architecture can still produce structurally
  rigorous wrong audits. The stack-discipline sub-step is a hedge, not a fix.
- Generated code, external services, dynamic dispatch, and indirection can
  evade footprint scans even when the feature exists.
- "Global" means global-within-visible-code, not global-within-system.
- Discipline is in the practice, not the prose. A 700-token rule
  half-followed is worse than a 200-token rule actually followed.
- This rule reduces but does not eliminate misclaims.
- When the architectural ontology is unclear, ask the user before concluding.

That is the rule. Now the reasoning.

The standard hallucination story does not fit. The agent searched. It searched fine. It just searched for the wrong shapes.

The rule above forces the second question.

What I want from you:

Try it. Run it as a system rule on a project where you use an AI coding agent for a few sessions.
Tell me what breaks. Specifically: hallucination shapes the structural footprint search would NOT catch, audit-theater patterns where the form is satisfied without the substance, over-triggering on questions that were not actually absence claims.
Tell me what you have written. If you have rules in your own CLAUDE.md or system prompt that solve adjacent problems, I want to read them.

Reply or DM on whichever platform you found this.

The misspelling stays.

4 rules I added to my CLAUDE.md after a week of weird CLI bugs

Nate Voss — Wed, 06 May 2026 10:45:50 +0000

I shipped a small CLI tool last week and ran it for the first time on my own machine. The output had a number that read 1,28,000 instead of 128,000. I stared at it for a minute, ran the same code in a node REPL, got the same wrong number back, and realized my locale was doing it. A few hours later I had four entries in my CLAUDE.md file that I should have written months ago.

Here they are.

1. Always pass an explicit locale to `toLocaleString`

Bare (128000).toLocaleString() is a trap. It uses whatever the system locale happens to be. On my machine that's en-IN, which renders 128000 as 1,28,000. On a US-locale machine the same code returns 128,000. The bug only shows up where the locale is set, which means CI passes, your unit tests on a fresh container pass, and the wrong separators land in production output.

// don't
total.toLocaleString()

// do
total.toLocaleString('en-US')

The rule I added: any time you generate number-formatting code for CLI or any user-facing text, pass an explicit locale. Never call it bare.

2. Keep CLI output for AI-editor tools under five lines

I learned this one by watching myself ignore my own tool. I'd built a small command for inspecting build artifacts and ran it inside an AI assistant's terminal. The result was a 12-line printout with the answer near the bottom. The assistant collapsed it behind a Ctrl+O to expand prompt, and I never expanded it. The agent reading the output never saw the result either.

If a CLI is designed to run inside an AI assistant's bash, the design constraint is that assistant's read window, not your own terminal. Result plus summary in 4 to 5 lines max. Verbose mode behind a flag.

✓ build ok
3 packages, 412kb gzipped
fastest: core (180ms)
slowest: ui (910ms)

That fits. Anything more elaborate gets folded behind the expand prompt and effectively disappears from both you and the model.

3. `npx <package>` fails inside the package's own monorepo

I burned half an afternoon on this one. I was developing the CLI inside a pnpm workspace, ran npx my-cli to smoke-test it, and got resolver errors. The package built fine. The bin field was correct. Outside the repo it ran clean. Inside the workspace, the resolver had different ideas about which version of which thing to use, because workspace context confuses it.

The fix is not a fix, it's a docs entry. If a CLI lives in a monorepo, your README should say install globally with npm install -g or run from outside the project directory. The rule I added to CLAUDE.md tells the assistant to never suggest npx <pkgname> as a smoke test from inside the same repo that defines the package.

4. Shared-library fixes need version bumps on both sides

I had a CLI that depends on a small library of mine as an external npm package, not a bundled module. I found a bug in the library, fixed it, ran the CLI's tests against the local working tree, everything passed, and shipped. The bug was still live.

The CLI's package.json was pinned to the previous library version. Fixing the library does nothing downstream until you publish a new version of the library AND bump the consumer's dependency to match. Local test runs lie because they resolve to your working tree, not the published artifact.

The rule, copy-paste:

If a fix touches a shared library that the consumer depends on as
an external npm package (not vendored, not workspace-linked):
  1. Publish the library with a new version
  2. Bump the consumer's dep range
  3. Publish the consumer
Otherwise the fix doesn't ship.

I added that to CLAUDE.md as a checklist the assistant walks through before claiming a bug is fixed.

Closing

If you want to copy one of these into your own setup, here's the format that works in CLAUDE.md, cursor rules, or copilot instructions:

When generating any number-formatting code for CLI output or user-facing text, always pass an explicit locale to toLocaleString (typically 'en-US'). Never call it bare. System locale varies by region and produces wrong thousands separators (e.g. 1,28,000 instead of 128,000).

Add it once, save yourself the next version of this same day.

Model Routing: 3 Things I Learned Sending Tasks to the Cheapest Model That Actually Works

Nate Voss — Mon, 04 May 2026 06:54:30 +0000

Everyone benchmarks models. Sonnet beats Haiku on reasoning. Opus beats Sonnet. Haiku is fastest. These things are all true.

But benchmarking and deploying are different games. At scale, the difference between Haiku at $0.80/million tokens and Sonnet at $3/million tokens isn't academic. It's $400+ monthly on a mid-size application. The trap is paying for capability you don't actually need because you never measured what you do need.

I built a router to answer one question: which tasks in my actual workflow could run on the cheapest model without failing? The answer surprised me. And I learned that the real value isn't the savings. It's the forcing function. You can't implement routing without auditing exactly where your complexity lives.

3 Things I Learned

1. Your Intuition About Task Complexity Is Backwards

You think something needs Sonnet. Your gut says: "this requires reasoning, obviously expensive model."

So I measured. Content classification? Haiku handles 95% of real requests. Writing summaries? 88%. Extracting structured data? 92%. The edge cases that needed Sonnet were smaller than I'd guessed. And they were always the same types of edge cases.

Here's the pattern I found: obvious cases are really obvious to Haiku. Spam detection, data validation, simple extractions. Haiku nails these. The failures cluster in a small, identifiable category: ambiguous cases where the human answer is ambiguous. That's when you need Sonnet's nuance.

But you don't know your edge case percentage until you try. Guessing leaves money on the table.

2. You Need Observability Before Routing Saves Anything

The instinct is to build the router first. "Let's write logic that detects complex requests and routes to Sonnet."

This is backward. You need to measure first. Log every task with both Haiku and Sonnet responses side-by-side. Compare them. Find the patterns.

Real questions to answer:

When did Haiku refuse a task that Sonnet handled?
How often do their answers differ, and which one was right?
Was Haiku just uncertain, or actually wrong?

This requires instrumenting your inference layer. It takes a week. But you can't optimize what you can't see. Most teams skip this and build routers on intuition, which is why their routers are fragile.

3. Routing Rules Should Be Dumb, Not Smart

The temptation: build a classifier that predicts task complexity. Input length heuristics, keyword matching, embedding similarity. Something sophisticated.

Don't. Use a simple rule: "If the model reports low confidence, escalate to Sonnet."

This separates the decision from the task. Haiku tells you when it's uncertain. That's a signal you can act on immediately, without needing to predict the future.

The dumb rule wins because:

It adapts as your tasks change (no retraining)
It's testable (you can verify the confidence threshold)
It fails safely (escalation costs more but works)

The smart rule loses because routing logic becomes load-bearing infrastructure. Requires constant tuning. Breaks when your data distribution shifts.

How It Works

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function classifyWithFallback(text, confidenceThreshold = 0.7) {
 // First pass: try Haiku (cheap, fast)
 const haikuResponse = await client.messages.create({
 model: "claude-3-5-haiku-20241022",
 max_tokens: 100,
 messages: [
 {
 role: "user",
 content: `Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "${text}"`
 }
 ]
 });

 const haikuResult = JSON.parse(haikuResponse.content[0].text);

 // Log all Haiku decisions (even successes)
 // You're building a dataset of "when does Haiku work?"
 console.log({
 text: text.slice(0, 50),
 model: "haiku",
 classification: haikuResult.classification,
 confidence: haikuResult.confidence,
 tokensUsed:
 haikuResponse.usage.input_tokens +
 haikuResponse.usage.output_tokens
 });

 // If Haiku is unsure, escalate to Sonnet
 if (haikuResult.confidence < confidenceThreshold) {
 const sonnetResponse = await client.messages.create({
 model: "claude-3-5-sonnet-20241022",
 max_tokens: 100,
 messages: [
 {
 role: "user",
 content: `Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "${text}"`
 }
 ]
 });

 const sonnetResult = JSON.parse(sonnetResponse.content[0].text);
 console.log({
 text: text.slice(0, 50),
 model: "sonnet",
 escalatedFrom: "haiku",
 classification: sonnetResult.classification,
 tokensUsed:
 sonnetResponse.usage.input_tokens +
 sonnetResponse.usage.output_tokens
 });

 return sonnetResult;
 }

 return haikuResult;
}

That's it. Run both models in parallel during development and log the results. In production, start with Haiku, escalate on low confidence. As your logs accumulate, you'll see exactly which tasks need expensive models and which don't.

The Math

Haiku: $0.80 per 1M input tokens
Sonnet: $3 per 1M input tokens

Scenario: 1M requests/month, 200 tokens average
- All Sonnet: 1M × 200 tokens = $600
- 95% Haiku: (950k × 200) Haiku + (50k × 200) Sonnet = $152 + $30 = $182
- Savings: $418/month

At enterprise scale (100M requests/month): $41,800/month saved by routing to the cheapest viable model.

The cost difference compounds. Small routing decisions get multiplied across thousands of requests.

One Common Pitfall

You'll build a sophisticated router and wonder why it doesn't move the needle. Usually because:

You spent three months on routing logic, but you spend one week validating it
The escalation threshold is too aggressive ("if anything looks hard, use Sonnet")
You're routing on heuristics, not observed behavior

The fix: measure first, always. Log both models' responses in parallel before committing to either one. You'll find that the obvious cases are really obvious, and the edge cases are smaller than you think.

When Routing Actually Works

Build it if:

You have >100k requests/month (smaller volume doesn't justify overhead)
Your requests fall into clusters (some are cheap tasks, some are hard)
You can measure ground truth (compare Haiku vs Sonnet, track which was right)

Don't build it if:

<10k requests/month (infrastructure overhead isn't worth it)
Every request is unique and complex (no pattern to exploit)
You need 99.9% accuracy (can't tolerate Haiku failures)

The Real Win

The cost savings are real. But the bigger win is the audit itself. Building a router forces you to measure exactly where your complexity actually lives. Most teams overthink what they need because they never measure. The router is just the excuse to finally look.

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

Nate Voss — Mon, 27 Apr 2026 08:04:50 +0000

You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.

I have this thing called Agent-Max — it's a multi-platform growth agent that runs autonomous workflows: generating content, publishing to Bluesky, Medium, Twitter, Reddit. Sounds heavy, right? Every Monday it synthesizes a week of reading, scrapes engagement metrics, decides what to post and where. Seven platforms. Infinite LLM calls if you're not paying attention.

Last Sunday I realized I had no idea what I was actually spending. I knew roughly — "somewhere between $5-20/week" — but roughly is how you end up with bill shock. So I built PromptFuel to solve the actual problem: measure what your app is doing, not what the docs say it should do.

Here's what three days of auditing my own code taught me.

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

I assumed my biggest cost sink was the weekly reflection. Claude reads 7 days of snapshots, engagement data, content history, trend analysis, then reasons about next week's strategy. Heavy prompt, right?

Nope.

Running pf optimize on the actual prompts showed the reflection was 2,847 tokens. Not small, but fine. The real killer: the daily content pregeneration loop was calling Claude 5 times per platform, and each call had:

Entire engagement history (redundant. I'm fetching fresh data every run)
Every. Single. Previous. Post. (all 120 of them, in the context)
Current date, weather, trending topics (reloaded every call)

Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even reading.

2. Your audit will surface the dumb stuff, not the obvious stuff

Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.

Except I was calling Claude Sonnet 7 times/week on background analytics where Haiku was plenty. Not intentional. I'd copied the model from an earlier prompt and never thought about it again. One-line change, zero quality loss, $2 saved per month.

That math never shows up in a benchmark. It shows up in your actual codebase, on your actual data, running your actual job. PromptFuel's advantage isn't telling you models are expensive. It's finding the calls you forgot about and showing you the before/after side-by-side.

3. Once you see the numbers, the optimization loop becomes obvious

The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.

Auditing once is useful. Auditing every week is how you stop bleeding money.

Let's walk through it

Install:

npm install -g promptfuel

Run pf optimize on a real prompt:

pf optimize ./src/prompts/reflect.md --model claude-3-5-sonnet

You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.

Open the dashboard to watch prompts in real time:

pf dashboard --watch ./src/

Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.

For production, wire up the SDK:

import { PromptFuel } from 'promptfuel/sdk';
import Anthropic from '@anthropic-ai/sdk';

const pf = new PromptFuel();
const client = pf.wrapClient(new Anthropic());

const response = await client.messages.create({
 model: 'claude-3-5-sonnet-20241022',
 max_tokens: 1024,
 messages: [{ role: 'user', content: 'your prompt' }]
});

// Automatically tracked. One line changes nothing
console.log(pf.getMetrics()); 
// { totalTokens: 342, totalCost: $0.008, calls: 1 }

Real numbers

Agent-Max before: ~1,847 tokens/week across all platforms.

Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.

40% reduction. No quality loss. Three hours to audit and implement.

That's not a benchmark. That's a real app, real prompts, real data.

Stop guessing about your token spend. Measure what you're actually doing.

npm install -g promptfuel

https://promptfuel.vercel.app?utm_source=devto&utm_medium=social&utm_campaign=max

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

Nate Voss — Mon, 27 Apr 2026 06:29:33 +0000

You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.

Here's what three days of auditing my own code taught me.

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

Nope.

Entire engagement history (redundant. I'm fetching fresh data every run)
Every. Single. Previous. Post. (all 120 of them, in the context)
Current date, weather, trending topics (reloaded every call)

Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even reading.

2. Your audit will surface the dumb stuff, not the obvious stuff

Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.

3. Once you see the numbers, the optimization loop becomes obvious

The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.

Auditing once is useful. Auditing every week is how you stop bleeding money.

Let's walk through it

Install:

npm install -g promptfuel

Run pf optimize on a real prompt:

pf optimize ./src/prompts/reflect.md --model claude-3-5-sonnet

You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.

Open the dashboard to watch prompts in real time:

pf dashboard --watch ./src/

Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.

For production, wire up the SDK:

import { PromptFuel } from 'promptfuel/sdk';
import Anthropic from '@anthropic-ai/sdk';

const pf = new PromptFuel();
const client = pf.wrapClient(new Anthropic());

const response = await client.messages.create({
 model: 'claude-3-5-sonnet-20241022',
 max_tokens: 1024,
 messages: [{ role: 'user', content: 'your prompt' }]
});

// Automatically tracked. One line changes nothing
console.log(pf.getMetrics()); 
// { totalTokens: 342, totalCost: $0.008, calls: 1 }

Real numbers

Agent-Max before: ~1,847 tokens/week across all platforms.

Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.

40% reduction. No quality loss. Three hours to audit and implement.

That's not a benchmark. That's a real app, real prompts, real data.

Stop guessing about your token spend. Measure what you're actually doing.

npm install -g promptfuel

https://promptfuel.vercel.app?utm_source=devto&utm_medium=social&utm_campaign=max

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

Nate Voss — Mon, 27 Apr 2026 06:13:27 +0000

You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.

Here's what three days of auditing my own code taught me.

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

Nope.

Entire engagement history (redundant. I'm fetching fresh data every run)
Every. Single. Previous. Post. (all 120 of them, in the context)
Current date, weather, trending topics (reloaded every call)

Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even reading.

2. Your audit will surface the dumb stuff, not the obvious stuff

Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.

3. Once you see the numbers, the optimization loop becomes obvious

The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.

Auditing once is useful. Auditing every week is how you stop bleeding money.

Let's walk through it

Install:

npm install -g promptfuel

Run pf optimize on a real prompt:

pf optimize ./src/prompts/reflect.md --model claude-3-5-sonnet

You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.

Open the dashboard to watch prompts in real time:

pf dashboard --watch ./src/

Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.

For production, wire up the SDK:

import { PromptFuel } from 'promptfuel/sdk';
import Anthropic from '@anthropic-ai/sdk';

const pf = new PromptFuel();
const client = pf.wrapClient(new Anthropic());

const response = await client.messages.create({
 model: 'claude-3-5-sonnet-20241022',
 max_tokens: 1024,
 messages: [{ role: 'user', content: 'your prompt' }]
});

// Automatically tracked. One line changes nothing
console.log(pf.getMetrics()); 
// { totalTokens: 342, totalCost: $0.008, calls: 1 }

Real numbers

Agent-Max before: ~1,847 tokens/week across all platforms.

Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.

40% reduction. No quality loss. Three hours to audit and implement.

That's not a benchmark. That's a real app, real prompts, real data.

Stop guessing about your token spend. Measure what you're actually doing.

npm install -g promptfuel

https://promptfuel.vercel.app?utm_source=devto&utm_medium=social&utm_campaign=max

How I Accidentally Spent $800/Month on LLM Tokens I Didn't Need (And How to Fix It)

Nate Voss — Thu, 23 Apr 2026 07:46:42 +0000

I spent six weeks shipping the wrong thing.

I built PromptFuel because I was hemorrhaging money on API calls. Not because I was building at scale—I wasn't. I was building dumb. I'd write a prompt in isolation, test it once, ship it, then wonder why my OpenAI bill jumped $200. Turns out I was doing things like:

Asking GPT-4 to write validation logic that Haiku could handle just fine
Sending full context windows when 30% of it was redundant
Retrying identical requests with slightly different temperatures instead of picking one and sticking with it
Including examples in prompts that the model was already trained on

The real kick? None of this was visible. I had no idea which requests were wasteful, which models were overkill for my tasks, or where I was throwing money away. I just had a credit card statement and regret.

So I built a tool to see what I was actually doing. And then I optimized it. Here's how.

The Problem Nobody Talks About

Choosing the right model for a job isn't about capabilities. A Haiku can validate JSON, classify text, and format output just as well as GPT-4o for most real work. The difference is cost: Haiku costs 10x less per token.

But without visibility, you default to the expensive one. Because it's safe. Because you can't see the waste.

After I started measuring, I found:

35% of my requests didn't need GPT-4o. They were hitting it because it was the default, not because it was the right tool.
20% of my prompts had bloat. Instructions that contradicted each other, examples I copy-pasted but never used, context I included "just in case."
15% of requests were duplicates. Same input, same model, within minutes. If I'd cached or batched them, I'd cut token spend by half.

Total: 40% waste. $800 → $480. Not revolutionary, but real money for an indie project.

The fix wasn't rocket science. It was boring infrastructure: measure, analyze, optimize, repeat.

Step 1: See What You're Actually Doing

Install PromptFuel:

npm install -g promptfuel

That's it. No API keys, no auth, no bullshit. The tool runs locally.

Now run this on any prompt or code snippet:

pf optimize --input "Your prompt here"

Or point it at a file:

pf optimize --file my-prompt.txt

You get back:

Token count — exactly what you'll be charged for
Cost estimates — broken down by model (Haiku, Sonnet, GPT-4o, etc.)
Optimization suggestions — what you can trim without losing meaning
Model recommendations — which model actually makes sense for this task

Example output:

Current prompt: 412 tokens

Optimization suggestions:
  - Remove redundant instruction (line 8)
  - Simplify JSON schema example (saves 34 tokens)
  - Collapse repeated context (saves 18 tokens)

Cost per call:
  - GPT-4o: $0.006 (❌ overpowered)
  - Claude 3.5 Sonnet: $0.002 (✓ recommended)
  - Claude 3 Haiku: $0.0004 (✓ if you only need classification)

Estimated monthly (1000 calls):
  - Current setup: $6.12
  - Optimized: $1.84

That's the insight. That's what I was missing.

Step 2: Understand Your Actual Costs

Open the dashboard:

pf dashboard

Your default browser opens to a local dashboard showing:

All your recent prompts and their token counts
Cost distribution — which requests ate the most budget
Model usage — are you using the expensive ones too much?
Optimization opportunities — ranked by potential savings

The dashboard doesn't need your API keys. It's analyzing local data. But it will tell you which of your shipped prompts are costing way more than they should.

Spend 10 minutes here. You'll probably find something you didn't realize you were doing.

Step 3: Integrate into Your Stack

Once you see the waste, you'll want to catch it earlier. That's where the SDK and MCP server come in.

Option A: JavaScript SDK (for Next.js, Node apps)

npm install @promptfuel/sdk

import { PromptOptimizer } from '@promptfuel/sdk';

const optimizer = new PromptOptimizer();

const prompt = `You are a helpful assistant...
Classify the following text into categories...
[20 more lines of context you don't actually need]`;

const analysis = await optimizer.analyze(prompt);

console.log(`This prompt costs $${analysis.costPerCall.gpt4o}`);
console.log(`Optimized version: $${analysis.optimized.costPerCall.gpt4o}`);

// Actually use the optimized version
const optimizedPrompt = analysis.optimized.text;

Option B: Claude Code MCP Server (for use in Claude directly)

If you're like me and you use Claude for a lot of your thinking, add the PromptFuel MCP server to your Claude Code settings. Then ask Claude directly:

@promptfuel optimize my prompt for cost

[paste your prompt]

Claude runs it through PromptFuel's analysis and tells you exactly where you're bleeding money. Then it generates an optimized version.

Both approaches catch waste before it ships.

What Happened Next

After I actually measured and optimized my stuff, here's what I learned:

You don't need the expensive model as often as you think. Most of my classification, formatting, and even some reasoning tasks work fine on Haiku.
Prompt bloat is real. Every instruction that contradicts another one, every "just in case" example, every "let me explain the context" paragraph adds tokens and confusion.
Token count scales weird. I thought I'd save 10%. I saved 40%. Because once you see the pattern, you fix it everywhere.

For me: $800 → $480/month. For you, it might be different. But it won't be zero.

Getting Started (Right Now)

Install: npm install -g promptfuel
Optimize a single prompt: pf optimize --file your-prompt.txt
Open the dashboard: pf dashboard
If you like it, integrate the SDK or MCP server into your workflow

No commitment. No API keys. No upsell. Just a free tool that shows you where your money's going.

The tool exists because I was tired of guessing. If you are too, give it a try: https://promptfuel.vercel.app?utm_source=devto&utm_medium=social&utm_campaign=max

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Nate Voss — Tue, 21 Apr 2026 12:07:05 +0000

If you're still picking LLM providers by gut feeling, you're leaving money on the table. I ran 5 developer use cases through Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash using PromptFuel to measure token usage and cost. The results? More interesting than "fastest wins." Here's what I found.

The Setup

I took 5 tasks I actually do in PromptFuel development:

JSON schema validation prompt — catch malformed API responses
Code review feedback — multi-file analysis with context
Refactoring suggestion — optimize a chunky utility function
Bug diagnosis — trace through a stack trace with logs
Documentation generation — write API docs from code comments

Each got run through all three models with identical input. I used PromptFuel's CLI to count tokens and calculate costs, because doing this manually is chaos. Output quality was rated by me (subjectively, but honestly).

Use Case Breakdown

1. JSON Schema Validation

Input: Schema definition + malformed JSON sample + expected error message format

Token usage (input → output):

Claude Sonnet: 1,847 → 512 (cost: $0.0043)
GPT-4o: 2,156 → 487 (cost: $0.0082)
Gemini Flash: 1,923 → 501 (cost: $0.0001)

Quality: All three nailed it. Claude was most concise in its explanation. GPT-4o over-explained. Gemini was crisp and useful.

Token efficiency win: Gemini, by cost. Claude, by clarity per token.

2. Code Review (3 files, ~200 LOC)

Input: Three TypeScript modules + review instructions + examples of good feedback

Token usage:

Claude Sonnet: 4,231 → 891 (cost: $0.0147)
GPT-4o: 4,782 → 856 (cost: $0.0208)
Gemini Flash: 4,456 → 823 (cost: $0.0003)

Quality: Claude caught subtle issues I actually cared about. GPT-4o was thorough but verbose. Gemini gave surface-level feedback.

Token efficiency win: Gemini cheapest. Claude best output/token.

3. Refactoring Suggestion

Input: 80-line utility function + performance requirements + current bottleneck description

Token usage:

Claude Sonnet: 2,134 → 618 (cost: $0.0054)
GPT-4o: 2,445 → 602 (cost: $0.0110)
Gemini Flash: 2,287 → 587 (cost: $0.0002)

Quality: Claude's refactor was production-ready. GPT-4o suggested good ideas but with syntax issues. Gemini's suggestion worked but wasn't elegant.

Token efficiency win: Gemini cost, Claude quality.

4. Bug Diagnosis

Input: Stack trace (15 lines) + error logs (20 lines) + code snippet (40 lines) + attempted fixes tried

Token usage:

Claude Sonnet: 2,856 → 445 (cost: $0.0071)
GPT-4o: 3,102 → 421 (cost: $0.0127)
Gemini Flash: 2,934 → 438 (cost: $0.0002)

Quality: Claude nailed it immediately. GPT-4o circled around the issue. Gemini flagged the right file but not the root cause.

Token efficiency win: Gemini cost, Claude accuracy.

5. Documentation Generation

Input: 12 functions with JSDoc comments + expected markdown format + examples

Token usage:

Claude Sonnet: 3,445 → 734 (cost: $0.0118)
GPT-4o: 3,821 → 689 (cost: $0.0182)
Gemini Flash: 3,567 → 712 (cost: $0.0004)

Quality: Claude's docs were complete and well-structured. GPT-4o was good but required minimal cleanup. Gemini's docs were functional but missing details.

Token efficiency win: Gemini cost, Claude completeness.

The 3 Things I Learned

1. Cost-per-task != best value. Gemini Flash is comically cheap (~90% less than GPT-4o), but you're paying for what you get. When I needed high-stakes work (code review, bug diagnosis), Claude was worth the extra cents because I didn't have to iterate. For throwaway tasks (generating examples, formatting), Gemini's cost made its mediocrity acceptable.

2. Token count is not predictive of quality. All three models produced similar token counts for the same input, but output quality varied wildly. GPT-4o consistently used more tokens and wasn't proportionally better. Claude packed useful signal into fewer tokens. This matters: if you're optimizing for cost alone, you'll pick the wrong model.

3. Real-world testing beats benchmarks. The model rankings flip depending on what you're actually doing. For documentation, Claude wins. For budget validation of a throwaway check, Gemini wins. Generic "fastest model" articles don't capture this. You need to test your actual tasks.

How to Benchmark Yours

Here's the thing: this comparison is data, not law. Your tasks might weight differently. Let me show you how I tested this using PromptFuel.

# Install PromptFuel (if you haven't)
npm install -g promptfuel

# Create a test file with your prompt
cat > test-prompt.txt << 'EOF'
[your prompt here]
EOF

# Count tokens across models
pf count test-prompt.txt --model claude-3-5-sonnet
pf count test-prompt.txt --model gpt-4o
pf count test-prompt.txt --model gemini-2.0-flash

# Compare costs
pf count test-prompt.txt --compare

That --compare flag gives you a cost matrix. Takes 30 seconds. Beats guessing.

The real insight: run this for your specific use cases. A document summarizer might favor Claude. A high-throughput classification pipeline might favor Gemini. The only way to know is to test.

The Real Optimization

After picking your model, there's still money left on the table. Here's a before/after from actual PromptFuel code:

Before (unoptimized prompt):

You are an expert code reviewer. Review the following code for quality, security, 
and performance issues. Check for common bugs, suggest improvements, and rate the 
code from 1-10. Consider edge cases, error handling, and best practices. Be thorough 
and detailed in your feedback.

[400 tokens of instructions]
[200 tokens of examples]
[150 tokens of code to review]
Total: ~750 input tokens

After (optimized with PromptFuel):

Review code for quality, security, performance. Rate 1-10.

[Stripped redundant instructions]
[Examples reduced to 1 exemplar instead of 3]
[Code reformatted to remove whitespace]
Total: ~420 input tokens

Cost saved: ~$0.0012 per review on Claude. Run that 100 times a day, and you're saving $0.12/day, $36/year. Small? Yes. Multiplied by 50 internal tools? Now you're talking real money.

The Honest Take

Pick the model that gives you the output you need, then optimize the prompt. Stop optimizing for the wrong metric. Benchmarks are fun, but production bills are real.

If you're running this analysis for your own stuff, PromptFuel makes it stupidly easy. It's free, no API keys needed, runs locally. Just npm install -g promptfuel and compare. If you want the actual numbers from your prompts, run the test. Don't inherit my data — build your own.

What's your highest-volume LLM task? Test it. You might be surprised which model wins.

Tags: #ai #tutorial #javascript #optimization

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Nate Voss — Tue, 21 Apr 2026 07:44:49 +0000

The Setup

I took 5 tasks I actually do in PromptFuel development:

JSON schema validation prompt — catch malformed API responses
Code review feedback — multi-file analysis with context
Refactoring suggestion — optimize a chunky utility function
Bug diagnosis — trace through a stack trace with logs
Documentation generation — write API docs from code comments

Use Case Breakdown

1. JSON Schema Validation

Input: Schema definition + malformed JSON sample + expected error message format

Token usage (input → output):

Claude Sonnet: 1,847 → 512 (cost: $0.0043)
GPT-4o: 2,156 → 487 (cost: $0.0082)
Gemini Flash: 1,923 → 501 (cost: $0.0001)

Quality: All three nailed it. Claude was most concise in its explanation. GPT-4o over-explained. Gemini was crisp and useful.

Token efficiency win: Gemini, by cost. Claude, by clarity per token.

2. Code Review (3 files, ~200 LOC)

Input: Three TypeScript modules + review instructions + examples of good feedback

Token usage:

Claude Sonnet: 4,231 → 891 (cost: $0.0147)
GPT-4o: 4,782 → 856 (cost: $0.0208)
Gemini Flash: 4,456 → 823 (cost: $0.0003)

Quality: Claude caught subtle issues I actually cared about. GPT-4o was thorough but verbose. Gemini gave surface-level feedback.

Token efficiency win: Gemini cheapest. Claude best output/token.

3. Refactoring Suggestion

Input: 80-line utility function + performance requirements + current bottleneck description

Token usage:

Claude Sonnet: 2,134 → 618 (cost: $0.0054)
GPT-4o: 2,445 → 602 (cost: $0.0110)
Gemini Flash: 2,287 → 587 (cost: $0.0002)

Quality: Claude's refactor was production-ready. GPT-4o suggested good ideas but with syntax issues. Gemini's suggestion worked but wasn't elegant.

Token efficiency win: Gemini cost, Claude quality.

4. Bug Diagnosis

Input: Stack trace (15 lines) + error logs (20 lines) + code snippet (40 lines) + attempted fixes tried

Token usage:

Claude Sonnet: 2,856 → 445 (cost: $0.0071)
GPT-4o: 3,102 → 421 (cost: $0.0127)
Gemini Flash: 2,934 → 438 (cost: $0.0002)

Quality: Claude nailed it immediately. GPT-4o circled around the issue. Gemini flagged the right file but not the root cause.

Token efficiency win: Gemini cost, Claude accuracy.

5. Documentation Generation

Input: 12 functions with JSDoc comments + expected markdown format + examples

Token usage:

Claude Sonnet: 3,445 → 734 (cost: $0.0118)
GPT-4o: 3,821 → 689 (cost: $0.0182)
Gemini Flash: 3,567 → 712 (cost: $0.0004)

Quality: Claude's docs were complete and well-structured. GPT-4o was good but required minimal cleanup. Gemini's docs were functional but missing details.

Token efficiency win: Gemini cost, Claude completeness.

The 3 Things I Learned

How to Benchmark Yours

Here's the thing: this comparison is data, not law. Your tasks might weight differently. Let me show you how I tested this using PromptFuel.

# Install PromptFuel (if you haven't)
npm install -g promptfuel

# Create a test file with your prompt
cat > test-prompt.txt << 'EOF'
[your prompt here]
EOF

# Count tokens across models
pf count test-prompt.txt --model claude-3-5-sonnet
pf count test-prompt.txt --model gpt-4o
pf count test-prompt.txt --model gemini-2.0-flash

# Compare costs
pf count test-prompt.txt --compare

That --compare flag gives you a cost matrix. Takes 30 seconds. Beats guessing.

The real insight: run this for your specific use cases. A document summarizer might favor Claude. A high-throughput classification pipeline might favor Gemini. The only way to know is to test.

The Real Optimization

After picking your model, there's still money left on the table. Here's a before/after from actual PromptFuel code:

Before (unoptimized prompt):

You are an expert code reviewer. Review the following code for quality, security, 
and performance issues. Check for common bugs, suggest improvements, and rate the 
code from 1-10. Consider edge cases, error handling, and best practices. Be thorough 
and detailed in your feedback.

[400 tokens of instructions]
[200 tokens of examples]
[150 tokens of code to review]
Total: ~750 input tokens

After (optimized with PromptFuel):

Review code for quality, security, performance. Rate 1-10.

[Stripped redundant instructions]
[Examples reduced to 1 exemplar instead of 3]
[Code reformatted to remove whitespace]
Total: ~420 input tokens

Cost saved: ~$0.0012 per review on Claude. Run that 100 times a day, and you're saving $0.12/day, $36/year. Small? Yes. Multiplied by 50 internal tools? Now you're talking real money.

The Honest Take

Pick the model that gives you the output you need, then optimize the prompt. Stop optimizing for the wrong metric. Benchmarks are fun, but production bills are real.

What's your highest-volume LLM task? Test it. You might be surprised which model wins.

Tags: #ai #tutorial #javascript #optimization

DEV Community: Nate Voss

The Leverage Shift: Why Infrastructure Cost Doesn't Matter Anymore

The numbers

The moat problem

Open source vs the API

The real cost: clarity

What this means for solo builders

Folks, need some feedback on this: https://dev.to/natevoss/i-wrote-a-rule-after-claude-got-is-x-built-wrong-4-times-looking-for-failure-modes-2f3i

I wrote a rule after Claude got "is X built?" wrong 4 times. Looking for failure modes.

I wrote a rule after Claude got "is X built?" wrong 4 times. Looking for failure modes.

Pre-Build Existence Audit Rule : looking for the failure modes I'm still missing

4 rules I added to my CLAUDE.md after a week of weird CLI bugs

1. Always pass an explicit locale to toLocaleString

2. Keep CLI output for AI-editor tools under five lines

3. npx <package> fails inside the package's own monorepo

4. Shared-library fixes need version bumps on both sides

Closing

Model Routing: 3 Things I Learned Sending Tasks to the Cheapest Model That Actually Works

3 Things I Learned

1. Your Intuition About Task Complexity Is Backwards

2. You Need Observability Before Routing Saves Anything

3. Routing Rules Should Be Dumb, Not Smart

How It Works

The Math

One Common Pitfall

When Routing Actually Works

The Real Win

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

2. Your audit will surface the dumb stuff, not the obvious stuff

3. Once you see the numbers, the optimization loop becomes obvious

Let's walk through it

Real numbers

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

2. Your audit will surface the dumb stuff, not the obvious stuff

3. Once you see the numbers, the optimization loop becomes obvious

Let's walk through it

Real numbers

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

2. Your audit will surface the dumb stuff, not the obvious stuff

3. Once you see the numbers, the optimization loop becomes obvious

Let's walk through it

Real numbers

How I Accidentally Spent $800/Month on LLM Tokens I Didn't Need (And How to Fix It)

The Problem Nobody Talks About

Step 1: See What You're Actually Doing

Step 2: Understand Your Actual Costs

Step 3: Integrate into Your Stack

What Happened Next

Getting Started (Right Now)

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

The Setup

Use Case Breakdown

1. JSON Schema Validation

2. Code Review (3 files, ~200 LOC)

3. Refactoring Suggestion

4. Bug Diagnosis

5. Documentation Generation

The 3 Things I Learned

How to Benchmark Yours

The Real Optimization

The Honest Take

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

The Setup

Use Case Breakdown

1. JSON Schema Validation

2. Code Review (3 files, ~200 LOC)

3. Refactoring Suggestion

4. Bug Diagnosis

5. Documentation Generation

The 3 Things I Learned

How to Benchmark Yours

The Real Optimization

The Honest Take

1. Always pass an explicit locale to `toLocaleString`

3. `npx <package>` fails inside the package's own monorepo