Your reasoning API is billing you for thoughts the model could skip

Every time your product calls a reasoning model, you pay for tokens. Not just the answer — the entire inner monologue. A model working through a coding problem might generate 5,000 to 30,000 "thinking tokens" before it writes a single character of output. 1 You don't see those tokens; you pay for them anyway.

A paper published June 4 by researchers at the University of Pennsylvania, UC San Diego, and Meta proposes a different architecture entirely: don't tokenize the thoughts at all. 2 Their system, NF-CoT (Normalizing Flow Chain-of-Thought), replaces the entire chain-of-thought token sequence with 64 compact continuous vectors — mathematical structures that carry the same reasoning content at a fraction of the compute cost.

The paper got essentially zero social attention the week it dropped. 3 That won't last. Here's what it means before the hype cycle catches up.

What text-based CoT actually costs you

Chain-of-thought reasoning forces the model to verbalize every step before moving to the next one. That sounds natural — it mirrors how humans explain their work. But the "text is the medium of thought" assumption is a historical accident of how these models were trained, not a fundamental requirement for good reasoning.

The consequence is severe verbosity. A typical reasoning sequence for a code generation problem encodes roughly 385 text tokens per reasoning trace in NF-CoT's benchmark setup. 4 Text is a lossy, sequential, commit-before-you-know-where-you're-going format: every word must be chosen before the thought is complete, and the model has to spell out things that don't need to be said at all.

The cost shows up in two places on your invoice. First, the sheer token volume. Researchers at Stanford, UC Berkeley, CMU, and Microsoft found that across 8 frontier reasoning models, thinking token consumption for the same query varies by up to 860% depending on which model you use — and that 21.8% of the time, the model with the lower list price ends up costing more because it generates far more thinking tokens. 5 Second, latency: those tokens must be generated sequentially before your user sees anything.

What NF-CoT does differently

NF-CoT replaces the text reasoning trace with 64 continuous latent "slots" — vectors in a high-dimensional space, generated by a normalizing flow (a mathematical function that maps between probability distributions while preserving structure). 2

NF-CoT compresses 385 verbalized reasoning tokens into 64 compact latent vectors — NF-CoT compresses a ~385-token reasoning trace into 64 latent slots, cutting intermediate compute without changing the model's output format. 2

Three properties make this viable for production:

KV-cache compatible: NF-CoT still runs left-to-right autoregressively, so it works with existing inference infrastructure. KV-cache is the reuse mechanism that lets servers handle many concurrent users without recomputing context from scratch. 4
Single-pass generation: earlier diffusion-based latent reasoning (LaDiR) needed 30 iterative denoising steps per reasoning block. NF-CoT completes in one pass. 2
RL-compatible: the latent space supports reinforcement learning fine-tuning (GRPO — optimizing model behavior via outcome-based feedback), so reasoning quality can keep improving post-deployment. 2

The numbers that matter

The benchmarks are on code generation (5 standard tasks: MBPP, MBPP+, HumanEval, HumanEval+, LiveCodeBench). Starting from Qwen3-8B as the base model:

Loading chart…

Average pass@1 across the five benchmarks: 68.8 for NF-CoT versus 55.8 for the base model — a +13.0 percentage point gain. 2 Adding RL fine-tuning (GRPO) in the latent space pushes this to 70.1 (+14.3). 2

Versus LaDiR, the previous best latent reasoning method: 2

2.48× fewer FLOPs per sample (19.9 vs. 49.3 TFLOPS)
1.92× faster total inference (325.6s vs. 625.3s)
6.66× less training compute

The paper also reports pass rates exceeding explicit text chain-of-thought, making this a strict improvement rather than a cost tradeoff — at least on code tasks. 2

The latent reasoning lineage

NF-CoT is the third generation of a research line:

Meta's COCONUT (December 2024) proved the concept by passing the model's internal hidden state directly as the next input, skipping token decoding. On logic tasks requiring backtracking, it beat standard CoT — but produced deterministic continuous thoughts with no probability model. 6

LaDiR (UCSD/Meta, October 2025) added structure via latent diffusion, but required 30 denoising steps per block and wasn't KV-cache compatible. 7

NF-CoT (June 2026) solves both: single-pass generation, exact probability estimation, standard inference compatibility.

The latent reasoning research arc from COCONUT to NF-CoT — Three generations of latent reasoning, each fixing the prior method's production blocker. NF-CoT is the first to check all three engineering constraints simultaneously. AI-generated illustration.

The open research gap: NF-CoT has only been tested on code generation. Math reasoning and multi-hop question answering remain untested, and the authors flag this explicitly. 2 No open-source code release and no API offering yet either.

What this means for your product

Today, the lever you have is reasoning_effort. OpenAI GPT-5.5 offers six tiers from none to xhigh. 8 Anthropic's adaptive thinking for Claude Opus 4.7/4.8 uses low through max. 9 The spread isn't trivial: on the same coding task, none versus xhigh can differ by 5.5× in cost. 10 A simple support classification feature should be on low; multi-step agentic code generation should be on high. If you haven't mapped your features to effort tiers, do that before your user count scales.

In 12 months, the question to ask your model vendors: when latent reasoning arrives in your API, how will you meter thinking vectors? This matters because thinking token costs are already opaque enough that the Stanford/Berkeley/CMU/Microsoft study found they cause pricing inversions 21.8% of the time. 5 Removing thinking tokens from cost comparison reduced those inversions by 70%. Latent reasoning could make the auditing problem worse or better depending on whether vendors price on FLOPs. That's not settled.

One caveat: losing a readable chain-of-thought creates interpretability problems. Researchers tracking AI safety have raised concerns that latent reasoning may be impossible to audit externally. 11 If your product operates in a regulated domain where you need to explain model reasoning to auditors, this is a compliance variable to track before adopting latent modes.

What to do this week

Set effort tiers before you scale: run a sample of your last 1,000 API calls, classify each by task type (simple lookup vs. multi-step reasoning vs. code generation), and assign each tier a default reasoning_effort. The goal is to stop paying high rates for tasks that don't need it.

Watch for NF-CoT on Hugging Face: the research code isn't public yet, but implementations typically appear within 1–3 months of publication. When it does, the question to ask your infrastructure team is whether your LLM serving stack can run a hybrid model where some positions use a normalizing flow head and others use a standard language model head — that's the architecture switch that enables latent reasoning without a full rewrite.

Ask your model vendor two questions: (1) Do you have plans for latent/compressed reasoning modes? (2) If so, how will thinking compute be metered? The answer will tell you whether to optimize for token-count or FLOP-count in your architecture decisions.

The paper itself is four days old and has two tweets. The underlying problem — that verbalized reasoning is an unnecessarily expensive format — has been building since COCONUT in 2024. NF-CoT is the first version that solves the production engineering constraints. The commercial question is who productizes it first.

Full paper: arXiv:2606.06447 — University of Pennsylvania, UC San Diego, Meta. Published June 4, 2026. 2

Cover image: AI-generated