LLMs in the Industry: What's Actually Working in 2025

Most teams have run a pilot. The demo worked, the stakeholders were impressed, and someone wrote “productionize this” on a roadmap. Then the project stalled — not because the model stopped performing, but because the operational gap turned out to be wider than anyone expected. The actual challenge with LLMs in production isn’t getting a model to generate something plausible. It’s building the infrastructure around it that makes the output reliable, cost-effective, auditable, and recoverable when things go wrong.

The Gap Between Demo and Production

A demo that works 90% of the time is not a product. In most software, 90% accuracy would mean a broken feature. For LLMs, it often gets called “impressive.” That framing needs to change before you can reason clearly about production readiness.

Production-grade LLM deployment means controlling for things that a prototype never has to face:

p99 latency. GPT-4o and Claude Sonnet can return a response in 2-4 seconds under normal load. At p99, that can spike to 15-20 seconds or timeout entirely during high-traffic periods. If your product requires consistent sub-5s responses, you need caching strategies, streaming, and fallback logic — not just a working API call.

Cost at scale. A system that costs $0.02 per request sounds cheap until it’s handling 100,000 requests per day. That’s $2,000 daily, $60,000 monthly. Token costs are easy to ignore at demo scale and hard to ignore in production budgets.

Model versioning. When Anthropic or OpenAI updates a model, your prompts can behave differently. Outputs that passed evals in June may fail in October. You need prompt versioning, eval regression suites, and a process for validating behavior before you migrate to a new model version.

Auditability. In healthcare, finance, and legal contexts, “the model said so” is not an acceptable audit trail. You need to log inputs, outputs, model versions, and timestamps. You need to be able to reconstruct why a specific output was generated, and you need retention policies that comply with your data obligations.

None of this is unusual — it’s just standard software engineering applied to a component that behaves probabilistically. The teams that scale successfully treat LLMs the way they’d treat any external service with variable behavior: with retries, circuit breakers, evals, and monitoring.

Where LLMs Are Genuinely Useful Today

The use cases where teams are seeing real production ROI share a few common traits: the input is messy and unstructured, the output scope is bounded, and there’s a human in the loop for high-stakes decisions.

Document Intelligence

Processing unstructured documents — contracts, medical records, insurance filings, financial disclosures — was the first category to see serious ROI at scale. The reason is structural: these documents are high-value, the extraction tasks are well-defined, and organizations already had humans doing the work before LLMs existed.

A contract review workflow that extracts termination clauses, auto-renewal dates, and liability caps from a 50-page PDF is doing something concrete. The LLM isn’t making a decision; it’s doing structured extraction. A human reviews the output. The failure mode is a missed clause, which is the same failure mode as a tired paralegal — except the LLM is faster and cheaper per document.

Tools like Unstructured.io, LlamaIndex, and LangChain’s document loaders have made the ingestion pipeline more approachable, but the real work is in prompt engineering for consistent extraction and building the eval harness to catch regressions when the model updates.

Code Generation

GitHub Copilot has over one million paid subscribers. JetBrains AI Assistant, Cursor, and Codeium are all seeing adoption. The productivity case for code generation is well-established in specific contexts.

Where it works well: boilerplate generation, unit test scaffolding, documentation from code, SQL query construction from natural language schema descriptions, and converting between similar data structures. These are tasks where the pattern is clear and the cost of a wrong suggestion is low — a developer reviews it before accepting.

Where it breaks down: complex logic with many cross-file dependencies, security-sensitive code (the model has no concept of your threat model), and anything requiring deep context about architectural decisions made months ago. GitHub’s own research found that Copilot-generated code has higher rates of certain vulnerability patterns than human-written code, particularly around input validation. That’s not a reason to avoid it — it’s a reason to keep security review in your workflow regardless.

The failure modes are predictable, which makes them manageable. Use code generation where it’s strong, and don’t remove the review steps that catch where it’s weak.

Customer Communication Drafting

First-response drafting for customer support is a low-risk, high-value application. The model drafts a reply to an inbound ticket; a human reviews and sends it. The LLM isn’t autonomous — it’s a first draft.

The operational setup is straightforward: pull the ticket, include relevant account context in the prompt, generate a draft, surface it in the agent’s queue for approval. The time savings are real (typical first-response times drop by 40-60% in teams that deploy this well), and the risk profile is low because a human approves every outgoing message.

Where teams overreach is when they try to make this fully autonomous. Removing the human approval step is where you start getting confidently wrong responses sent to customers at 2am.

Clinical Documentation

Ambient AI for clinical note generation is one of the more technically interesting and genuinely useful deployments of LLMs in any regulated industry. Tools like Nuance DAX Copilot, Nabla, and Suki work by listening to a patient encounter and generating a structured clinical note — SOAP format, HPI, assessment, plan — which the physician then reviews and signs.

The model’s job is transcription and structure, not diagnosis. The physician review step is non-negotiable, which makes it viable under existing regulatory frameworks. Physicians spend an estimated 1-2 hours per day on documentation; getting that down meaningfully has real consequences for burnout and patient throughput.

The technical challenge here is latency-at-the-end-of-appointment (the note needs to appear quickly) and PHI handling, which requires covered entity agreements with model providers and often means data never leaves a specific cloud region. Several hospital systems have deployed this at scale with those constraints in place.

Where Things Break Down

Not every use case is a good fit. Some failure patterns are predictable enough to save you a lot of time if you recognize them early.

Prior authorization in healthcare. PA workflows involve ambiguous payer criteria, constantly changing policy documents, and real financial and clinical consequences for wrong outputs. The combination of high ambiguity, frequent policy changes, and serious consequences makes this a poor fit for autonomous LLM decision-making in its current form. Assistive use (helping staff find relevant policy language) is more tractable than autonomous determination.

Real-time decision systems. If your system requires sub-200ms decisions — fraud scoring, content moderation at feed scale, financial risk checks — current LLM latency profiles don’t fit. You can cache common cases and use faster, smaller models for some scenarios, but the general-purpose frontier models are not competitive with purpose-built classifiers on latency.

Autonomous agents without checkpoints. Multi-step agents that can take actions — writing files, sending emails, calling APIs — amplify errors. An early wrong step propagates through the chain. Teams that have succeeded with agents tend to keep the scope narrow, add explicit human approval gates at consequential decision points, and design for graceful failure rather than long uninterrupted chains.

High-stakes outputs with hard evaluation. If you can’t build a reliable eval suite, you can’t know when the system is failing. Some domains — complex legal analysis, novel scientific reasoning, rare clinical presentations — are hard to evaluate automatically and hard to evaluate at scale by human reviewers. If you can’t measure accuracy reliably, you can’t deploy reliably.

The Operational Reality

The token economics of production LLM applications are worth running through explicitly, because they surprise people.

Claude Sonnet 3.5 is priced at $3 per million input tokens and $15 per million output tokens (approximate, as of mid-2025). A moderately complex support ticket with 2,000 input tokens and a 500-token draft response costs roughly $0.014 per request. At 10,000 requests per day, that’s $140/day or about $4,200/month — before infrastructure.

That math changes significantly with prompt caching. Anthropic’s API caches prompt prefixes, which means if you have a long system prompt that’s identical across most requests, you’re charged the cache write cost on first use and a fraction of the read cost on subsequent requests (roughly 10% of the original input cost per cache hit). For applications with large static system prompts — detailed instructions, few-shot examples, policy documents — caching can cut costs by 70-90%.

# Example: using prompt caching with Anthropic's API
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # This gets cached after first request
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": user_query}
    ]
)

Fine-tuning vs. RAG is a question that comes up constantly, and the answer depends on what you’re trying to fix.

Fine-tuning is useful when you need the model to consistently produce a specific format, adopt a specific writing style, or internalize domain-specific terminology that the base model handles poorly. It’s not a good solution for keeping the model up to date with knowledge that changes — you’d need to retrain every time the knowledge changes.

RAG (retrieval-augmented generation) is the right choice when the knowledge you need is too large to put in context, changes frequently, or needs to be auditable back to a source document. Most enterprise knowledge base applications should default to RAG unless they have a specific format or style problem that RAG can’t solve.

# Rough RAG pipeline structure
def answer_with_context(query: str) -> str:
    # Retrieve relevant chunks
    chunks = vector_store.similarity_search(query, k=5)
    
    # Build context string
    context = "\n\n".join([chunk.page_content for chunk in chunks])
    
    # Generate with context
    response = llm.invoke(
        f"Context:\n{context}\n\nQuestion: {query}"
    )
    return response

Build vs. Buy

The decision tree is more practical than it might seem.

If your data is relatively standard (English text, common document types), your latency requirements are above 2 seconds, and you need to move fast — use the API. OpenAI, Anthropic, and Google all offer production-grade APIs with uptime SLAs, and you can build a meaningful product without managing any model infrastructure.

If you have large volumes of domain-specific training data, consistent input/output patterns, and the engineering capacity to manage fine-tuning runs and evaluation — fine-tuning on a capable base model can get you better quality at lower cost than repeated prompting of a frontier model.

If you have privacy constraints that prevent sending data to cloud APIs — HIPAA covered entity requirements, financial data residency rules, customer contracts that prohibit third-party processing — you need to look at open-source models you can run on your own infrastructure. Llama 3.1 (Meta), Mistral Large, and Qwen 2.5 are all credible options at the 70B parameter range, performant enough for most document and code tasks, and deployable on-premise or in a private VPC.

The hidden cost in the “build your own” path is always operational: GPU infrastructure, model serving (vLLM is the standard choice for open-source model serving), monitoring, and the engineering time to keep up with model updates. Budget for it honestly before committing.

What to Prove Before You Scale

Before committing engineering resources to scale an LLM feature, verify these things:

You have an eval framework. Before you scale, you need to know when you’re regressing. That means a test set of representative inputs with expected outputs (or expected properties of outputs), and a way to run those evals automatically when your prompt or model changes. Without this, you’re flying blind when the model updates.

Your cost model holds at target volume. Run the math at 10x and 100x your current request volume. Include API costs, infrastructure, and human review time if applicable. If the numbers don’t work, you need to change the architecture — caching, batching, smaller models for simpler requests — before you scale.

You’ve tested latency under load. p50 latency in development is not p99 latency in production. Test with concurrent requests at realistic volumes. Add streaming if you haven’t; it makes latency feel faster even when total processing time is the same.

You’ve documented the failure modes. What happens when the model returns garbage? What happens when the API is down? What happens when the output fails your validation logic? Every production system needs explicit behavior for each of these — fallback, retry, alert, or graceful degradation.

You have a rollback plan for model updates. When your model provider updates the underlying model, your evals may catch a regression. You need a path to pin to an older model version, even temporarily, while you fix your prompts. Most providers support this for at least one version back.

Human review is scoped correctly. If your application involves consequential decisions, be explicit about where humans are in the loop, what they’re actually reviewing, and how long that realistically takes. A design that assumes humans review every output at high volume tends to degrade into humans rubber-stamping outputs — which is a different risk profile than you intended.

LLMs are genuinely useful in production. The teams that are making them work aren’t doing anything exotic — they’re applying the same operational discipline they’d apply to any complex external dependency, being honest about where the technology performs and where it doesn’t, and keeping humans appropriately in the loop for decisions that matter.