Benchmarks Grow Up: MLPerf Pivots From Tokens to Tasks

The day the stopwatch gave way to the checklist

For years, progress in artificial intelligence was reported in a single beat: how many tokens per second can you push through a model. It was a clean number and easy to compare, but it was not what users actually cared about. This week, MLCommons moved the goalposts to where the work happens. The new MLPerf Inference release adds the first evaluations for agentic tool use, long-context reasoning, and on-device multimodal workloads. The new headline metrics are not just throughput. They are tasks completed, time to completion, and joules per task.

In other words, the stopwatch is giving way to the checklist and the energy meter. If you run a contact center, a robot lab, a newsroom, or a hospital, this is the change you were waiting for. The benchmarks now look more like the jobs you need done.

What changed and why it matters

The new MLPerf categories do three big things:

They test tool-using agents. Models now have to choose when to call a calculator, hit a database, or search a corpus, then stitch the answers together into an action or a report.
They test reasoning across long context. Instead of a neat paragraph, models get a stack of documents, meeting transcripts, code, or sensor logs, and must find the signal in the noise.
They test multimodal work on the device. Phones, headsets, cars, and factory cameras need vision, audio, and text working together within a tight power and thermal budget.

If that sounds like the way people work, that is the point. A single number like tokens per second is like measuring a factory by how fast the conveyor belt moves. Useful, but not the same as counting finished products, the time to deliver them, or the electricity bill. MLPerf is now counting finished products.

The result is a new race. It rewards smart decision-making, efficient memory use, and careful coordination between hardware and software. The scoreboard now includes joules per task, not just tokens per second. Vendors will tune for end-to-end outcomes. Buyers will change what they ask for. Chip roadmaps will follow the new pressure.

Inside the new tests: concrete examples

Consider a tool-using agent benchmark. The prompt is a messy customer email about a double charge and a canceled order. The model must decide to check a transaction log, query a customer account, calculate a refund, and write a response that follows policy. A token-per-second metric could look great while the model confidently hallucinates a fix. The new benchmark measures whether the model called the right tools, produced a correct resolution, used the fewest steps, and did it within a time and energy budget.

For long context, picture a legal assistant asked to summarize risk across 200 pages of contracts, emails, and meeting minutes. The model has to navigate a long window, pull key obligations and exceptions, and produce a short brief with citations. The old metrics did not punish repeated scanning or wasteful attention. The new tests reward models and runtimes that bring the right text into focus, cache it smartly, and avoid reprocessing what they already know.

On-device multimodal workloads look like this: a phone camera watches a home repair, listens to a user describe a problem, and overlays step-by-step guidance. The device has a few watts to spare and a small memory budget. The benchmark checks whether the system recognizes the parts accurately, follows the conversation, answers quickly, and keeps within a strict energy envelope.

These are real jobs. They force the stack to work as a system, not as a single model in a vacuum.

The new metrics: from speed to outcomes and energy

MLPerf did not throw away speed. Latency still matters, and throughput still matters when you operate at scale. What changed is the top line.

Task success rate: Did the system complete the job correctly according to a reference standard.
Time to first useful action and time to completion: How quickly did the system produce something you can use, and how long did it take to finish.
Joules per task: How much energy did the system consume to get the job done.
Tool efficiency: How many tool calls did it make, how much redundant work did it perform, and how much network or disk I O did it trigger.

When you optimize for these, you end up drawn toward different design choices. Suddenly speculative decoding is not just a party trick. Mixture of experts looks attractive for long sequences. Memory efficient runtimes are not optional. System level co-design stops being a research slogan and becomes a procurement requirement.

Why vendors will move, fast

Benchmarks shape markets. If you sell accelerators, servers, or cloud instances, you live and die by the slide that shows your position vs the field. When that slide starts to plot task completion and joules per task, your roadmap changes.

Here is what is likely to accelerate in the next two quarters:

Speculative decoding in the runtime and in hardware. Vendors will add ways to guess likely future tokens and verify them, turning a stop-and-go process into a smooth flow. Expect dedicated control paths, smarter branch prediction in decoders, and memory layouts that keep candidates close.
Mixture of experts at scale. Instead of one huge model that does every task, a router sends each step to a small group of specialists. That reduces total compute for the same quality on long context and complex tool use. Hardware schedulers will learn to keep expert weights resident and to prefetch the right ones just in time.
Memory first design. Long context stresses memory bandwidth and capacity. Look for more stacks of high bandwidth memory, new compression for key value caches, and smart eviction that respects narrative flow. Runtime libraries will fuse attention, cut redundant passes, and reuse intermediate results across tool calls.
System level co-design. Companies will ship packages where model, runtime, interconnect, and even the tool suite are tuned together. Think of a phone where the vision model, speech model, and text model have shared buffers and a common planner. Or a server where the vector database, the retrieval model, and the generator are on the same fabric with predictable latency.

This is not theoretical. Nvidia already invests in speculative decoding kernels and attention optimizations. Google leans into mixture of experts in production. Apple, Qualcomm, and MediaTek have been pushing on-device multimodal features where every milliwatt matters. Amazon and Microsoft tune complete stacks that include storage and network paths for retrieval. The new benchmarks will amplify these bets.

How procurement scorecards will change

If you buy systems, the scoreboard you carry to a vendor meeting just changed. Expect scorecards to look like this:

Define the job: a set of tasks that reflect your actual use. For a bank, that could be compliance summaries and customer email resolution. For a retailer, it could be product search, store associate guidance, and returns processing.
Measure outcomes: task success rate and the quality of outputs. Use hidden test cases and clear grading rubrics to avoid gaming.
Measure end-to-end time: from request to final action or answer. Count tool calls, queue time, and retries.
Measure energy and cost: joules per task and total cost per task. Include network and storage, not just compute.
Stress the system: longer contexts, multilingual cases, noisy audio, or low light video. See how systems degrade.

Be specific about weights. A contact center might value task success over raw speed. A trading desk might pay for the lowest possible latency. A wearable device team will put joules per task front and center.

With the new MLPerf tests as a public baseline, your private scorecard can be a thin layer on top. You get comparability across vendors and still keep it tuned to your domain. Over the next two quarters, expect requests for proposals to mirror this structure.

The engineering playbook: what to build now

For builders and platform teams, the new benchmarks are a map of where to invest.

Plan the agent, not just the model. Agents need planners, tool registries, and memory. Build a simple decision layer that chooses which tool to call and when to stop. Keep a scratchpad for intermediate results. Cache what you fetch.
Treat context as a budget. Long context is a shared resource. Compress aggressively and keep track of what you have already processed. Use segment aware attention. Chunk documents by meaning, not by character count.
Make energy visible. Add counters for joules per task in your test harness. On servers, capture accelerator and network power. On devices, read system energy sensors. Make the number visible in continuous integration dashboards.
Use mixture of experts when tasks vary. A single model is a Swiss Army knife. Experts are scalpels. Route code to code experts and math to math experts. Keep the router simple and predictable.
Adopt speculative decoding with guardrails. Make predictions in parallel, verify quickly, and fall back when uncertainty spikes. Track when speculation helps and when it wastes work.
Align tools with the runtime. Tools are not just application programming interfaces. They have latency and resource footprints. Profile them. Put frequent ones close to the model, or even on the same host. Use batchable versions of tools where possible.

This is ordinary engineering discipline applied to a field that is finally ready to reward it.

The chip angle: what silicon will add

Chips speak in constraints. The new tests tighten the constraints that matter.

Control flow acceleration. Tool use is a lot of branching. Expect accelerators and compilers to pack more dynamic control capabilities, so models can pause, call a tool, and resume without losing throughput.
Memory centric layouts. Long context wants wide, low latency access. That means more high bandwidth memory, clever on-chip cache hierarchies, and software that pins the right parts of the sequence in fast tiers.
Expert residency. Mixture of experts works best when the right experts are already loaded. Vendors will enable fast expert swaps, weight compression tailored to specific experts, and router hints at compile time.
Energy at the task level. Chips will expose counters that let runtimes close the loop on joules per task. Expect firmware and drivers to surface per-stream energy to the application.
Multimodal on device. Phone and headset chips will grow small matrices for vision and audio, better image signal processor to neural interface, and shared memory regions so models can pass features without going through slow paths.

If you look at the roadmaps from Nvidia, AMD, Intel, Google, Apple, and Qualcomm, you can already see these themes. The new MLPerf release gives them a common target to shoot at.

Trade-offs and how to avoid the traps

New benchmarks invite new ways to game them. There are real risks. Smart buyers and honest builders can navigate them.

Overfitting to the task set. If a vendor hand tunes for benchmark tasks, the system may look great on paper and brittle in the field. Solution: keep a private test set with hidden tasks and rotate them.
Tool caches that leak the answer. If tools are preloaded with benchmark data, you are not testing generalization. Solution: require cold start runs, log cache hits, and test with live or randomized backends.
High variance agents. Agents can be creative, which can also mean inconsistent. Solution: measure variance across runs, set bounds on retries, and reward predictable behavior.
Energy accounting that forgets the rest of the system. It is easy to count accelerator watts and ignore network or storage. Solution: include system power and per-request networking in joules per task.
Latency games. Systems can rush a weak first token to look fast. Solution: measure time to first useful action with a rubric and tie it to accuracy.

These are not new to benchmarking, but they matter more when the unit is a real job, not a token stream.

Policy and public sector implications

Public agencies are large buyers of compute and models. The new MLPerf tests give them a way to ask better questions. A city that deploys translation and summarization for services can demand joules per task limits and multilingual success rates. A health system can require evidence that long-context summaries maintain accuracy under noisy inputs. A school district that buys classroom assistants for tablets can set power budgets that preserve battery life through a school day.

Because the tests are public and repeatable, they can also reduce the risk of lock-in. If a vendor meets the task and energy targets with open disclosure of methods, agencies can compare alternatives without guessing. That will increase pressure to publish full system configurations and to adopt interoperable tool interfaces.

What to watch over the next two quarters

If you want to track the shift from tokens per second to tasks done, here are visible signs.

Vendor slides change. You will see joules per task and task success rate on the first page. When that happens, roadmaps have already been updated.
Cloud instance names and pricing change. Expect instances that bundle retrieval, vector databases, and inference in a single price with a task guarantee.
On-device demos get practical. Watch for phones doing real time translation that lasts a whole flight on a single charge, with consistent accuracy.
Compiler releases mention agents. Runtimes will talk about control flow, persistent memory across tool calls, and efficient routing for experts.
Requests for proposals ask for task level guarantees. Energy budgets and end-to-end latency will be stated as must-meet targets.

When these appear, the benchmark shift has rippled into real buying and building.

Zooming out: the cultural shift inside teams

There is also a softer change underway. Teams that used to say model, model, model are starting to say system. That brings in people who manage memory layouts, storage engines, and network scheduling. It rewards engineers who shave seconds off a database query rather than chase one more point on a static leaderboard.

Think of it as moving from a speed trial to a relay race. You still care about the sprinter. You also care about the baton pass, the handoff zones, and the timing between runners. Task level benchmarks force you to practice the whole relay.

A practical checklist to leave with

Whatever your role, you can take steps this week.

If you build models or agents: add energy and end-to-end metrics to your development cycle. Practice on one real task from your users. Report success rate, time to completion, and joules per task on every pull request.
If you run infrastructure: expose per-request energy and latency to your application teams. Pre-position common tools near the model. Measure cold start penalties.
If you buy systems: draft a one-page scorecard with your top three tasks, target success rates, target times, and energy budgets. Share it with vendors before you see demos.
If you design chips: publish a short note on how your next part accelerates control flow, long context, or multimodal on device. Give software teams an early driver with energy counters.

Progress happens faster when everyone pulls toward the same scoreboard.

Conclusion: from noisy speed to useful work

The new MLPerf release is not just a fresh set of numbers. It is a sign that artificial intelligence is leaving the lab treadmill and stepping into the workshop. The field has plenty of raw speed. The question now is whether systems can finish real jobs, do it predictably, and do it within the energy and cost limits of the world we live in.

That is a healthier race. It rewards careful engineering, wise allocation of compute, and designs that respect reality. If the last era was about making models bigger, the next era is about making systems sharper. Benchmarks that count tasks and joules point the way. The rest of the industry is about to follow.