OpenTelemetry makes AI legible: a new spec arrives

The quiet standard that makes AI legible

For a year, production AI has felt like flying with a black box and no cockpit. We saw outputs, bills, and Git commits. We did not see what actually happened step by step across prompts, tools, safety checks, and token flows. This week, the OpenTelemetry project shipped a generative AI observability spec that fills in the cockpit instruments. The new semantic conventions standardize how to trace large language model calls, agent tool use, token accounting, and safety events.

Standards might sound sleepy. This one is not. It gives teams a vendor-neutral way to see and control AI behavior in real time. You can wire it into existing tracing backends, build apples-to-apples comparisons between models, run evaluations on live traffic, and set regression alerts that actually map to user experience and spend.

Think of it as a common grammar for AI traces. If your model call is one verb, your tool calls are another, and your safety checks are adjectives that qualify the action, then the spec tells everyone which words to use and where to put them in the sentence. With shared words, your traces are portable across providers like OpenAI, Anthropic, Google, Cohere, Mistral, and across inference stacks like vLLM or Text Generation Inference. They are also portable across observability vendors like Grafana, Honeycomb, New Relic, Datadog, Tempo, Jaeger, and Elastic.

What actually landed

OpenTelemetry’s generative AI semantics define a set of trace spans, attributes, and events that describe:

Model inference calls: model name, provider, parameters such as temperature and max tokens, prompt and response metadata, input and output token counts, latency, and cache hits.
Tool and function calls: which tool was invoked, arguments passed, results returned, timing, and errors.
Safety and guardrail checks: categories evaluated, scores or labels, whether an action was blocked, and the policy that decided it.
Usage and cost: token counts by role, tokenization latency, and optional cost estimates.

The spec keeps sensitive content optional and encourages redaction or hashing of full prompts and outputs. You get structure and comparability without forcing you to leak user data.

Under the hood, it looks like this:

Each LLM request emits a span with standard attributes. You can nest spans for retries, tool calls, and safety checks under the parent request span.
Token counts, safety outcomes, and intermediate tool steps are recorded as span attributes or span events. This lets you build metrics directly off traces and slice by anything in the attribute set.
Cross-step context travels with the trace. That means you can carry budgets, policy versions, or A/B assignments across tool boundaries and still reconstruct the full story later.

This is not a greenfield reinvent. It folds into the same OpenTelemetry pipelines that already collect HTTP traces, database spans, and service metrics in most modern stacks.

Why this matters right now

Cost control grows teeth. Token usage is first class, so you can track spend per user, per feature, and per model with the same rigor you apply to CPU or queries per second. No more weekly spreadsheet reconciliation.
Evaluation meets production. You can sample real traces to run offline or inline quality checks, then tie scores back to the exact model, prompt, and context that produced them.
Fair model comparisons. Standard attributes allow apples-to-apples comparisons across providers. You can run A/B tests and view latency, output length, and safety outcomes side by side without writing a new parser for each vendor.
Policy moves from slide deck to code path. Safety checks and policy outcomes become events in the trace. That means you can monitor, alert, and even gate actions based on the same telemetry that drives reliability.

Wire it up in a day

You do not need to rebuild your stack. Treat this like instrumenting a new HTTP endpoint. Here is a plain path to get signal flowing.

Pick your backend

If you already run an OpenTelemetry collector, add an OTLP exporter to your tracing backend of choice. Grafana Tempo, Jaeger, Honeycomb, New Relic, and Datadog all ingest OpenTelemetry traces. Use what your team already knows.
If you do not have a collector, start with the OpenTelemetry SDK direct-to-backend exporters, then add a collector later for batching, filtering, and data safety policies.

Turn on the SDK

Python and JavaScript are the quickest on-ramps. Install the OpenTelemetry SDK, the OTLP exporter, and any available auto-instrumentation packages.
Set a service name for your AI app. Turn on sampling. For production, start with head-based sampling at 10 to 20 percent; bump to 100 percent for safety-critical flows.

Instrument model calls

Wrap every LLM invocation in a span. Attach standard attributes from the spec. Record token usage and parameters. Here is the shape, shown in Python and JavaScript.

from opentelemetry import trace
tracer = trace.get_tracer("ai-app")

with tracer.start_as_current_span("genai.inference") as span:
    span.set_attribute("gen_ai.provider", "openai")
    span.set_attribute("gen_ai.model", "gpt-4o-mini")
    span.set_attribute("gen_ai.temperature", 0.2)
    span.set_attribute("gen_ai.prompt.hash", sha256(prompt).hexdigest())
    # call the model
    result = client.chat.completions.create(...)
    span.set_attribute("gen_ai.tokens.input", result.usage.prompt_tokens)
    span.set_attribute("gen_ai.tokens.output", result.usage.completion_tokens)
    span.set_attribute("gen_ai.cache.hit", result.get("system_fingerprint") == cache_key)

const span = tracer.startSpan("genai.inference")
span.setAttribute("gen_ai.provider", "anthropic")
span.setAttribute("gen_ai.model", "claude-3-5-sonnet")
span.setAttribute("gen_ai.temperature", 0.4)
// call the model
const resp = await client.messages.create(...)
span.setAttribute("gen_ai.tokens.input", resp.usage.input_tokens)
span.setAttribute("gen_ai.tokens.output", resp.usage.output_tokens)
span.end()

You do not need to attach raw prompts. If you must, use sampling plus redaction. Many teams store only a cryptographic hash and a handful of structured features such as instruction template id, few-shot count, and detected entities.

Instrument tools

Agent tools are where bugs hide and costs spike. Give each tool call its own span and link it to the parent inference span. Include arguments, latency, and success or failure.

For HTTP tools: use OpenTelemetry’s existing HTTP client instrumentation. Add attributes for tool name and purpose.
For database or retrieval tools: reuse the database or search semantic conventions. Add an attribute to mark tool type and the number of documents retrieved.

Record safety checks

Whether you use platform guardrails like Azure AI Content Safety, provider built-ins, or libraries like NeMo Guardrails, record the decision.

Add a child span for moderation with attributes for policy name, categories, scores, and final action.
If an action is blocked, attach a span event like safety.blocked with the reason. If you redact content, set an attribute such as gen_ai.safety.redacted true.

Make cost a first-class metric

You already have tokens. Convert to spend. Load price tables from configuration, not code. Write the cost to the span and also to a metric that can be aggregated daily.

cost_usd = (input_tokens * price_in_per_1k + output_tokens * price_out_per_1k) / 1000
Attach cost_usd to the inference span and export a metric cost_usd with labels model, provider, feature, tenant.

Treat privacy as a product requirement

Redact or hash sensitive fields before export. The collector can drop attributes or apply regex redaction.
Use sampling rules. For example, retain 100 percent of blocked safety events, 25 percent of successful calls, and 0 percent of content bodies.
Use trace context to carry a privacy_budget attribute. Start with a simple token or dollar cap per request. If the budget drops below zero, add a safety.blocked event and return a graceful error.

This is not theory. Teams already do similar work with bespoke schemas. The spec makes those conventions portable and comparable.

Eval on prod without the drama

Offline evaluations on synthetic data are helpful. They are not enough. The point of a trace is to capture reality at the moment it mattered. Here is how to turn traces into evaluation pipelines.

Sampling strategy. Decide what to evaluate. For example, sample 1 percent of traces where output length exceeds 500 tokens, or 5 percent of traces with a safety warning, or 10 percent of traces for a new prompt template.
Materialization. Export sampled traces to a queue or a lake with enough context to reproduce the call: model, parameters, prompt hash, and tool chain. Do not export raw content unless you need it and have consent.
Scoring. Run automated checks like groundedness (does the answer cite retrieved facts), refusal correctness, PII leakage, and jailbreak resilience. Libraries like Ragas, TruLens, and homegrown heuristics work. Log the scores back as span events on the original trace id, or as metrics keyed by trace id.
Feedback loop. Wire alerts for significant score drops. Store baselines by model and prompt version. When scores regress, pivot to the exact spans to see what changed.

Eval-on-prod should feel like load testing for quality. The overhead stays low because traces already carry the structure you need.

Regression alerts that map to user experience and spend

Traditional alerts watch latency and errors. For AI, add:

Output length percentiles by feature. Sudden shrinkage can be as bad as a 500 error.
Safety block rates by country or tenant. A spike can signal prompt drift or a new attack vector.
Tool call fan-out. If an agent goes from one tool call to five on average, your margin evaporates.
Cost per successful answer. Track cost_usd divided by successful outcomes. Alert on step changes.

Implement these with trace-derived metrics. Many observability backends can turn span attributes into metrics. Use latency histograms, token histograms, and cost counters. Tie alerts to change windows whenever you roll a new model, a new template, or a new routing rule.

Apples to apples model comparisons

Once every call carries the same attributes, you can run simple and fair experiments.

A/B or A/B/C by model name with matched parameters for temperature and max tokens. Use the same prompt template id across arms.
Slice results by task type or user segment. Model A might be cheaper for short answers but worse for long-chain answers with retrieval.
Compare safety outcomes per thousand requests. Some models may cost less up front but trigger more blocks or evasions, which increases operator overhead.

Build a standard comparison board in your observability tool:

Latency p50, p95 by model
Input and output token distributions by model
Cost per successful answer by model
Safety block rate and category distribution by model

With traces standardized, switching providers becomes a configuration change rather than an observability overhaul.

Policy enforcement, now wired into traces

Policies are useful only when they can stop bad actions and explain why. The spec makes policy a first-class citizen of the trace.

Put policy ids and versions on relevant spans. Example: policy.version = 7; policy.name = customer_data_handling.
Emit events when checks run and when actions are gated or allowed. Example: event safety.check with attributes category=PII, score=0.82, decision=block.
Carry budgets and risk scores in trace context. Downstream tools can read them and enforce limits without bespoke plumbing.

Two concrete patterns:

Action gating. Before a tool that emails customers runs, read the current safety and privacy context from the trace. If the content contains unverified personal data, set a span attribute action.gated true and return a retry that routes through a human-in-the-loop.
Privacy budgets. Start with a unit like pii_points. Detect entities such as email, phone, and names. Deduct points whenever content with those entities moves to an external API. When the budget is exhausted, block further externalization and log a safety.blocked event with budget_exhausted true.

These patterns turn policy into code you can observe and improve. You will know which flows get blocked, why, and what it costs to stay compliant.

What to watch next as SDKs and platforms adopt

The spec is the start. Adoption will determine its impact. Here is what to keep an eye on.

Native SDK support. Expect official OpenTelemetry instrumentation for common AI clients to land and mature in Python and JavaScript first. Watch for drop-in packages that wrap OpenAI, Anthropic, Vertex AI, Bedrock, and local inference servers like vLLM.
Framework hooks. LangChain, LlamaIndex, and Haystack already emit traces of a sort. The next turn is native emission of the new attributes so your traces normalize without glue code.
Platform-backed safety events. Cloud providers and model vendors will begin tagging safety category outcomes in a standard way. That makes policy dashboards work across vendors.
Red-team events. Expect conventions for recording attack attempts such as prompt injection, canary string trips, and jailbreak probes. When these are first class, you can build heatmaps of adversarial pressure.
Privacy budgets. Early patterns will solidify into recommended attributes and sampling rules. The collector will likely gain built-in processors for PII redaction and budget enforcement.
Action gating baked in. Tool runners and orchestration frameworks will start reading trace context by default, then short-circuiting actions when policies say no. That moves us from observability to enforcement with audit trails.

Vendors will compete on visualization, anomaly detection, and automated tuning. The standard reduces switching cost and increases pressure to build features rather than lock-in.

A practical checklist for teams

If you run AI in production today, you can act this week.

Decide your target backend and turn on OTLP ingestion. If you have none, pick Grafana Tempo or Jaeger to start.
Instrument model calls with the new semantics. Capture model, provider, temperature, token counts, and a prompt hash.
Wrap every tool call in a child span. Tag tool name, purpose, and latency.
Add a safety span or events whenever moderation runs. Record category, score, and decision.
Compute and record cost_usd per call. Roll up by feature and tenant.
Set three alerts: cost per success, safety block rate, and output length p95. Tune thresholds by feature.
Add a privacy budget attribute to trace context. Enforce a simple cap and log blocks.
Sample 1 percent of traces for offline eval. Score groundedness and refusal correctness. Record results back to the trace id.

This gives you observability, control, and a path to continuous improvement without locking into a proprietary schema.

The shape of the next year

With standardized traces, the AI stack becomes more modular. You can swap models without losing visibility. You can route requests to cheaper models when the task does not need depth, and catch when a route degrades quality. You can enforce safety policies with the same rigor used for security policies in web services.

Most importantly, your team can talk clearly about what the system is doing. Product can ask why a feature got slower and see tool fan-out grows. Finance can ask why a bill spiked and see output lengths drift. Compliance can ask whether personal data left the boundary and get a searchable record.

We are leaving the era of black box AI operations. The new spec gives us a shared language to make AI legible, controllable, and comparable. It is a boring superpower. Boring turns out to be what production needs.

Closing thought

Every important system eventually gains a dashboard and a discipline. Web got APM. Data got lineage. Now AI gets traces that explain themselves. Standards are not glamorous, but they compound. If you adopt the new generative AI semantics now, your team will spend less time patching holes and more time designing better agents. The cockpit is finally lighting up. It is time to fly like you mean it.