LLM API Cost Optimization: 7 Techniques to Cut Your Bill by 60%

A practical guide to LLM API pricing, plus proven techniques like prompt compression, model selection, caching, and batch processing to dramatically reduce your API spend — with code examples.

NixAPI Team February 3, 2025 ~4 min read

LLM APIs charge by the token — roughly one token per word (or ~1.5 Chinese characters). Many teams see their AI API bills explode after launch, but most of the waste is avoidable. Here are 7 proven techniques to bring costs down.


Understanding the Bill: Input + Output Tokens

Total cost = Input tokens × price + Output tokens × price
  • Input: Everything you send to the model (system prompt + conversation history + current message)
  • Output: The model’s response

Output tokens typically cost 3–4× more than input tokens, so limiting output length has the highest ROI.


Tip 1: Choose the Right Model — Don’t Use a Sledgehammer for a Nail

The simplest way to save money: don’t use the most expensive model for simple tasks.

Task typeRecommended modelRelative cost
Classification, keyword extractionGPT-4o mini / Claude Haiku
General Q&A, summarizationGPT-4o mini~5×
Complex reasoning, code generationGPT-4o / Claude 3.5 Sonnet~15×
Most complex taskso1 / Claude Opus~50×

Many use cases work perfectly with GPT-4o mini at 1/10th the cost of GPT-4o.


Tip 2: Compress Your System Prompt

System prompts consume input tokens on every single request. Trim them ruthlessly:

# ❌ Verbose (~80 tokens)
system = """
You are a highly professional customer service assistant. Your responsibility
is to answer user questions about our products. Please ensure your responses
are accurate, helpful, and maintain a friendly tone. If you don't know the
answer, say so directly — never fabricate information.
"""

# ✅ Concise (~20 tokens)
system = "Customer support. Answer product questions accurately. Say 'I don't know' when unsure."

Save 60 tokens per request × 100,000 daily requests = 180 million tokens per month.


Tip 3: Limit Output Length

Tell the model explicitly to keep answers short:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[...],
    max_tokens=200,   # ← hard cap
)

Also reinforce this in the prompt: “Answer in under 100 words.” Both together works best.


Tip 4: Trim Conversation History

In multi-turn conversations, the history grows and token usage can grow exponentially:

def trim_history(messages, max_tokens=3000):
    """Keep the most recent N tokens of history; always retain the system prompt."""
    system = [m for m in messages if m["role"] == "system"]
    others = [m for m in messages if m["role"] != "system"]

    # Rough estimate: 4 chars ≈ 1 token
    total = sum(len(m["content"]) // 4 for m in others)

    while total > max_tokens and len(others) > 1:
        removed = others.pop(0)  # drop oldest message
        total -= len(removed["content"]) // 4

    return system + others

Tip 5: Cache Repeated Requests

Don’t call the API for identical requests:

import hashlib, json

_cache = {}

def cache_key(model, messages):
    payload = json.dumps({"model": model, "messages": messages}, sort_keys=True)
    return hashlib.md5(payload.encode()).hexdigest()

def cached_completion(model, messages, **kwargs):
    key = cache_key(model, messages)
    if key in _cache:
        return _cache[key]  # free — no tokens consumed

    result = client.chat.completions.create(
        model=model, messages=messages, **kwargs
    )
    _cache[key] = result
    return result

For FAQ-style workloads, cache hit rates of 60%+ are common. Use Redis in production.


Tip 6: Batch API for Non-realtime Tasks

For offline workloads (data analysis, bulk translation, etc.), the Batch API costs half the price of the real-time API:

# Prepare batch tasks
tasks = [
    {"custom_id": f"task-{i}", "method": "POST", "url": "/v1/chat/completions",
     "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": text}]}}
    for i, text in enumerate(texts_to_process)
]

# Submit batch (completes within 24h at 50% discount)
batch = client.batches.create(
    input_file_id=upload_file(tasks),
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Tip 7: Monitor Token Usage — Find the Real Culprits

Know where your money is going before you optimize:

# Log usage after every request
usage = response.usage
print(f"This call: input={usage.prompt_tokens}, output={usage.completion_tokens}")

# Write to a database and aggregate by feature
log_usage(
    feature="chat",
    model=model,
    input_tokens=usage.prompt_tokens,
    output_tokens=usage.completion_tokens,
)

Typically 20% of features consume 80% of tokens. Focus your optimization effort there.


Summary

TechniqueExpected savings
Downgrade model50–90%
Compress system prompt10–30%
Limit output length20–50%
Trim conversation history20–60%
Cache repeated requests30–60%
Batch API50%

Combined, cutting costs by 60% overall is very achievable.


👉 Try NixAPI — transparent pricing, no monthly fees, free to start.

Try NixAPI Now

Reliable LLM API relay for OpenAI, Claude, Gemini, DeepSeek, Qwen, and Grok with ¥1 = $1 top-up

Sign Up Free