70% of AI Conversations Are Predictable | AI Agent Cost Optimization

We analyzed 580+ conversations from a live commerce AI agent and found that 70% follow predictable patterns. Two optimizations — quick responses and dynamic thinking budgets — cut AI costs by ~39% without losing intelligence.

TL;DR#

We run a commerce AI agent on Facebook Messenger — Claude Haiku 4.5 with extended thinking, handling 170+ conversations/day for a skincare brand in Bangladesh. After analyzing 580+ conversations, we discovered:

33% of first messages are Facebook auto-generated CTAs ("Can you check the price?")
20% are price asks ("দাম কত?" — "what's the price?")
17% are location questions, purchase CTAs, and greetings
The LLM was spending 2,048 thinking tokens on each of these predictable queries

We shipped two optimizations in one afternoon:

Quick Response Filter — pattern-match common first messages, return pre-built responses, skip the LLM entirely
Dynamic Thinking Budget — 1,024-token thinking for early turns, 2,048 for complex conversations

Result: ~39% cost reduction. Zero measurable quality impact. Deployed same day.

The Discovery#

Our agent was working. Conversations tripled from 60 to 170/day as the client increased Facebook ad spend. Orders grew from 3 to 7/day. But the AI cost scaled linearly with every conversation — even the ones that went nowhere.

We pulled the data:

580+ conversations analyzed (5-day window)

100 conversations (click-to-message ad traffic)
 │
 ├── 37 BOUNCED (37%) — 1 message, never returned
 │   └── 40% are Facebook auto-CTAs (not typed by customer)
 │
 ├── 37 LIGHT (37%) — 2-3 messages, price check & gone
 │   └── Got price, said "ok/hmm", disappeared
 │
 └── 26 ENGAGED (26%) — 4+ messages, real buying interest
      │
      ├── 2 ORDERS (7.7% of engaged)
      ├── 4 HANDOFFS (15% of engaged)
      └── 20 MID-FUNNEL (77% of engaged)

74% of conversations never reached meaningful engagement. They were ad clickers, price comparers, and Facebook's auto-generated icebreakers.

But every single one of them triggered a full LLM call — system prompt loaded, thinking tokens generated, response crafted.

We looked at the most common first messages:

Message	Count	Source
"Can you check the price of a product?"	33	Facebook auto-CTA
"Where are you located?"	9	Facebook auto-CTA
"দাম কত" / "দাম কতো" (what's the price?)	11	Customer typed
"pp" / "price"	4	Customer typed
"Can I make a purchase?"	2	Facebook auto-CTA
"hi" / "হাই"	4	Customer typed

63 out of 100 first messages fell into 3 categories: auto-CTAs, price asks, and greetings. The LLM's response to all of them was nearly identical — a price list with a warm greeting.

We were paying Claude to think about a question it answered the same way every time.

The Cost Problem: Thinking Tokens Are the Hidden Tax#

Here's what most teams miss about LLM cost optimization: prompt caching already solved the input problem.

Our setup uses Anthropic's prompt caching with a 1-hour TTL on the system prompt:

typescript

// System prompt cached at 1h TTL — read for $0.10/MTok instead of $1.00/MTok
const promptCachingHandler: ContextHandler = async (_ctx, { allMessages }) => {
  return allMessages.map((msg, idx) => {
    if (msg.role === 'system') {
      return {
        ...msg,
        providerOptions: {
          anthropic: { cacheControl: { type: 'ephemeral', ttl: '1h' } },
        },
      };
    }
    return msg;
  });
};

After the first turn, our ~8,000-token system prompt is read from cache at $0.10/MTok instead of $1.00/MTok. That's a 90% reduction on input tokens. Great.

But here's the cost breakdown for a typical "দাম কত?" (price ask) turn:

Token Type	Count	Rate ($/MTok)	Cost	% of Total
System prompt (cache read)	8,000	$0.10	$0.0008	3%
Conversation (cache read)	800	$0.10	$0.0001	0%
New message (input)	200	$1.00	$0.0002	1%
Thinking (output)	2,048	$5.00	$0.0102	36%
Response text (output)	300	$5.00	$0.0015	5%
Cache write (amortized)	—	—	$0.0150	55%

On cached turns (turn 2+), the cache write disappears and thinking becomes 78% of total cost.

The insight: Prompt caching made input cheap. Thinking tokens — billed as output at $5/MTok — are now the dominant cost. And unlike input, thinking tokens can't be cached.

Our agent's extended thinking budget was set to 2,048 tokens. For a "দাম কত?" response, the agent's thinking looked like this:

গ্রাহক দাম জানতে চাচ্ছেন। আমাদের জনপ্রিয় প্রোডাক্ট তিনটা —
উপটান ৳650, KT নাইট ক্রিম ৳500, ফেস ওয়াশ ৳150। প্রাইস লিস্ট দিই...

That's roughly 200-300 thinking tokens used. But the budget ceiling of 2,048 meant Haiku sometimes expanded its reasoning unnecessarily — adding context it didn't need, reconsidering the tone, checking rules it already followed. The budget was headroom being burned.

Layer 1: Quick Response Filter — Skip the LLM Entirely#

For the 55% of first messages that match known patterns, we don't call the LLM at all.

The Architecture#

Customer sends first message
        │
        ▼
  ┌─────────────────────┐
  │ prepareAgentRun()    │  Returns: customerText, agentTurnCount
  └──────────┬──────────┘
             │
             ▼
  ┌─────────────────────┐
  │ Quick Response Check │  agentTurnCount === 0?
  │ matchQuickResponse() │  Pattern match customerText?
  └──────────┬──────────┘
             │
       ┌─────┴─────┐
    MATCHED      NOT MATCHED
       │              │
       ▼              ▼
  Send pre-built   Continue to
  response         LLM pipeline
  Cost: $0.00      (with Layer 2
  Latency: ~100ms   optimization)

The Implementation#

The vendor config declares patterns and responses alongside the system prompt:

typescript

// convex/agents/vendors/thanakaBangladesh.ts

export const thanakaBangladeshConfig: VendorAgentConfig = {
  name: 'Facebook Sales Agent (Thanaka Bangladesh)',
  instructions: `...`, // ~8K token system prompt

  quickResponses: [
    {
      // Facebook auto-generated CTAs (33% of first messages)
      patterns: [
        'can you check the price of a product?',
        'can i make a purchase?',
        'আমি একটি পণ্যের দাম জানতে চাই',
      ],
      response: `আসসালামু আলাইকুম! থানাকা বাংলাদেশে স্বাগতম 🌿

আমাদের জনপ্রিয় প্রোডাক্ট:
উপটান (বড়) ৳650, KT নাইট ক্রিম ৳500, ফেস ওয়াশ ৳150।
আরও প্রোডাক্ট আছে — আর্বুটিন ক্রিম, কম্বো, লোশন। জানতে চাইলে বলুন।
কোনটা নেবেন?`,
    },
    {
      // Direct price asks — Bangla + Banglish (20%)
      patterns: [
        'দাম কত', 'দাম কতো', /^pp$/, /^price$/,
        /^dam koto/, /থানাকা.*দাম/, /thanaka price/i,
      ],
      response: `আমাদের জনপ্রিয় প্রোডাক্ট:
উপটান (বড়) ৳650, KT নাইট ক্রিম ৳500, ফেস ওয়াশ ৳150।
আরও প্রোডাক্ট আছে — আর্বুটিন ক্রিম, কম্বো, লোশন। জানতে চাইলে বলুন।
কোনটা নেবেন?`,
    },
  ],
};

The matcher runs in the dispatch action, before the LLM is instantiated:

typescript

// convex/agents/facebookVendorAgentBase.ts

export function matchQuickResponse(
  config: VendorAgentConfig,
  message: string,
  agentTurnCount: number,
): string | null {
  // Only match on first interaction — never intercept mid-conversation
  if (agentTurnCount > 0 || !config.quickResponses) return null;

  const normalized = message.trim().toLowerCase();
  for (const qr of config.quickResponses) {
    for (const pattern of qr.patterns) {
      if (pattern instanceof RegExp) {
        if (pattern.test(normalized)) return qr.response;
      } else {
        if (normalized === pattern.toLowerCase()) return qr.response;
      }
    }
  }
  return null; // No match — fall through to LLM
}

The key constraint: agentTurnCount > 0 bypasses the filter entirely. Quick responses only fire on the very first agent turn in a conversation. Once the customer sends a follow-up, the full LLM takes over.

What Happens on Match#

typescript

// convex/facebookAgentDispatch.ts

const quickReply = matchQuickResponse(vendorConfig, customerText, agentTurnCount);
if (quickReply) {
  // Log with [QuickResponse] tag for analytics
  await ctx.runMutation(internal.facebookMessages.logAgentMessage, {
    conversationId: args.conversationId,
    pageId: args.pageId,
    content: quickReply,
    thinking: `[QuickResponse] Matched: "${customerText.slice(0, 80)}"`,
  });

  // Send directly — no LLM, no thinking, no tool calls
  await ctx.runAction(internal.facebookSend.sendMessage, {
    pageId: args.pageId,
    psid: args.psid,
    text: quickReply,
  });

  // Release conversation lock
  await ctx.runMutation(internal.facebookConversations.finishAgentRun, {
    conversationId: args.conversationId,
    pageId: args.pageId,
  });
  return; // Done — $0.00 AI cost
}

No API key read. No model instantiation. No streaming. Just a regex match, a database write, and a Messenger API call.

Layer 2: Dynamic Thinking Budget — Less Reasoning for Simple Questions#

For the 45% of first messages that don't match quick responses — and for all subsequent turns — we still call the LLM. But we scale the thinking budget based on conversation depth.

The Insight#

Not all turns need the same reasoning power:

Turn	Typical Task	Thinking Needed
0-2	Price list, greeting, simple product info	Low — mechanical response
3-5	Objection handling, product comparison	Medium — needs reasoning
5+	Order flow, address validation, payment	High — multi-step logic

Our original setup used a flat 2,048-token thinking budget for every turn. The agent used ~200-400 tokens of thinking for price responses but had headroom to wander.

Before#

typescript

// Static budget — same for every turn
function createHaikuModel(apiKey: string) {
  return wrapLanguageModel({
    model: provider('claude-haiku-4-5-20251001'),
    middleware: defaultSettingsMiddleware({
      settings: {
        providerOptions: {
          anthropic: { thinking: { type: 'enabled', budgetTokens: 2048 } },
        },
      },
    }),
  });
}

After#

typescript

// Dynamic budget — scales with conversation depth
function createHaikuModel(apiKey: string, agentTurnCount: number = 0) {
  const budgetTokens = agentTurnCount <= 2 ? 1024 : 2048;

  return wrapLanguageModel({
    model: provider('claude-haiku-4-5-20251001'),
    middleware: defaultSettingsMiddleware({
      settings: {
        providerOptions: {
          anthropic: { thinking: { type: 'enabled', budgetTokens } },
        },
      },
    }),
  });
}

How Turn Count Flows Through the System#

The turn count comes from prepareAgentRun — the mutation that locks the conversation and builds the prompt. It already queries recent messages, so counting agent turns is free:

typescript

// convex/facebookConversations.ts — inside prepareAgentRun

const recentMessages = await ctx.db
  .query('fbMessages')
  .withIndex('by_conversationId', (q) => q.eq('conversationId', args.conversationId))
  .order('desc')
  .take(50);

// Count prior agent turns — zero extra DB queries
const agentTurnCount = recentMessages.filter((m) => m.role === 'agent').length;

return {
  messageId,
  threadId: conv.threadId,
  agentTurnCount,  // 0 = first turn, 1 = second, etc.
  customerText: combinedText,
  // ...
};

The dispatch action passes it to the model factory:

typescript

// convex/facebookAgentDispatch.ts

const agent = vendorConfig
  ? createVendorAgent(apiKey, vendorConfig, agentTurnCount)
  : createFacebookSalesAgentHaiku(apiKey);

One important design decision: staff handbacks always get the full 2,048 budget. When a human agent hands a conversation back to the AI, the context is complex — the customer was already escalated. We default agentTurnCount to 999 for the direct (non-batched) path.

The Math#

Per-Turn Cost Comparison#

Scenario	Input Cost	Thinking Cost	Output Cost	Total
Before (any turn)	$0.017	$0.0102	$0.0015	$0.028
Quick response (matched)	$0.000	$0.000	$0.000	$0.000
Dynamic budget (turn 0-2)	$0.017	$0.0051	$0.0015	$0.024
Dynamic budget (turn 3+)	$0.001	$0.0102	$0.0015	$0.013

Daily Cost at 170 Conversations/Day#

Before:

170 conversations × ~3 turns avg × $0.020/turn (blended) = $10.20/day
(First turn ~$0.028 with cache write, subsequent turns ~$0.013 cache read)

After:

Layer 1: 55% of first messages matched → 94 conversations start free
  94 × $0.000 (turn 1) + 94 × 2 more turns × $0.013  = $2.44

Layer 2: 45% not matched → low thinking on early turns
  76 × $0.024 (turn 1) + 76 × 2 more turns × $0.013  = $3.80

TOTAL: $6.24/day

Savings: $3.96/day → 39% reduction

At scale, this compounds. A brand doing 500 conversations/day would save ~$11.50/day or ~$345/month — significant for a Bangladeshi SMB where the product costs ৳650 ($6).

What We Didn't Do (And Why)#

Disable thinking entirely for early turns#

We considered setting thinking: { type: 'disabled' } for turns 0-2 instead of just reducing the budget. This would save 100% of thinking cost on those turns.

We rejected it because even with a 1,024-token budget, the agent occasionally catches edge cases that a zero-thinking model would miss — like recognizing a repeat customer's name in the first message, or detecting that "দাম কত ভাইয়া" (informal "bro") needs a different tone than "আপনাদের প্রোডাক্টের মূল্য জানতে চাই" (formal price inquiry).

1,024 tokens is enough for these micro-judgments. Zero is not.

Use a cheaper model for early turns#

We could route turns 0-2 to a smaller model (or even a rule-based system) and switch to Haiku for complex turns. We didn't pursue this because:

Model switching mid-conversation creates inconsistent personality
The conversation history format differs between model providers

Cache thinking tokens#

Thinking tokens are output tokens — they can't be cached in Anthropic's prompt caching system. Cache applies to input (system prompt, conversation history), not output. This is a fundamental constraint of the architecture.

When to Use This Pattern#

This optimization works when your agent has:

High-volume, low-variance first messages — click-to-message ads, chatbot widgets, and customer support all exhibit this pattern
Prompt caching already enabled — if you haven't set up caching, do that first — it's a bigger win
Extended thinking enabled — if you're not using thinking tokens, Layer 2 doesn't apply (but Layer 1 always does)
Measurable conversation data — you need to know your actual first-message distribution before building the pattern list

If you're running an LLM agent in production and haven't analyzed your first-message distribution, you're almost certainly overspending. The patterns are more predictable than you think.

FAQ#

Q: Won't the quick response feel robotic compared to the LLM?

We verified this before shipping. We compared the LLM's actual responses to the pre-built ones across 33 "Can you check the price?" conversations. They were nearly identical — the LLM had already converged on this exact format through the system prompt's price-first rules. The customer experience is indistinguishable.

Q: What if product prices change?

The quick responses live in the vendor config alongside the system prompt. When prices change, you update both in the same commit. One file, one deploy.

Q: Does the 1,024-token budget ever cut off important reasoning?

In our testing across 100+ early-turn conversations, the agent used 200-400 thinking tokens for price and greeting responses. The 1,024 ceiling provides comfortable headroom. We've never observed a truncation on turns 0-2.

Q: Can I use this with models other than Claude?

Layer 1 (quick responses) works with any model — it bypasses the LLM entirely. Layer 2 (dynamic thinking) is specific to Anthropic's extended thinking feature. For OpenAI's reasoning models, you'd adjust the reasoning_effort parameter similarly.

Sources#

Anthropic Documentation

Prompt Caching — cache TTL, pricing tiers
Extended Thinking — budget tokens, interleaved thinking

Pricing (as of March 2026)

Claude Haiku 4.5: $1/MTok input, $5/MTok output, $0.10/MTok cache read, $2/MTok 1h cache write

Internal Data

580+ conversations analyzed across 5-day window (March 26-30, 2026)
100-conversation deep inspection sample for funnel analysis
All claims verified via production queries

70% of Your AI Agent's Conversations Are Predictable. Act Like It.

TL;DR#

The Discovery#

The Cost Problem: Thinking Tokens Are the Hidden Tax#

Layer 1: Quick Response Filter — Skip the LLM Entirely#

The Architecture#

The Implementation#

What Happens on Match#

Layer 2: Dynamic Thinking Budget — Less Reasoning for Simple Questions#

The Insight#

Before#

After#

How Turn Count Flows Through the System#

The Math#

Per-Turn Cost Comparison#

Daily Cost at 170 Conversations/Day#

What We Didn't Do (And Why)#

Disable thinking entirely for early turns#

Use a cheaper model for early turns#

Cache thinking tokens#

When to Use This Pattern#

FAQ#

Sources#

Ready to See Karigor in Action?