Hardware implications of LLM variants

Apr 12, 2026

Lifecycle of an LLM request

When you send a prompt to a model like Claude, the first thing that happens is prefill. The entire input sequence (your system prompt, conversation history, any tool results, basically everything you're sending to the model) gets processed in parallel through every layer of the transformer. This is the compute-heavy phase. The model reads all your tokens at once, runs them through attention and feedforward layers, and produces a set of key and value vectors for each token at each layer. Those vectors get stored in the KV cache, which is basically the model's working memory of your conversation.

Once prefill is done, the model switches to decoding, generating output one token at a time. Each new token needs to attend to every previous token, which means reading the entire KV cache on every step. This is where the character of the workload flips. Prefill is bound by how fast you can do matrix multiplications (compute-bound), but decoding is bound by how fast you can shuttle the KV cache in and out of memory (memory-bandwidth-bound). The GPU cores are mostly just waiting on data.

So when Anthropic ships a model with a 1M token context window, or thinking modes that can produce 128K output tokens, the question isn't just "is the model smarter". It's "what does the hardware need to look like to actually serve this."


KV cache

The biggest hardware implication of going from 200K to 1M tokens is the KV (key-value) cache. During inference, the model stores key and value vectors for every token in context so it doesn't have to recompute them. This cache grows with every token in the context window and with every concurrent request being served. Once models start supporting 1M token context windows, the KV cache can easily exceed the model weights in total memory usage.

To put rough numbers on it, a 1-million token context can easily consume 40 to 80 GB of VRAM depending on model size and quantization. For a large model like Opus, that's on top of the already enormous memory footprint of the weights themselves. A single 1M-token request uses 5x the KV cache memory of a 200K request. And Anthropic needs to serve many of these concurrently.


Thinking effort levels

The effort levels (low through max) control how many "thinking tokens" Claude generates internally before responding. Max provides the deepest reasoning with no constraint on token spending, so responses are slower and cost more.

Since Opus 4.6 supports up to 128K output tokens, a max-effort thinking request could generate a huge volume of output tokens, all of which need to be decoded autoregressively (one token at a time). Each new token requires reading the full KV cache. That's where memory bandwidth becomes the bottleneck, not raw compute.

Hardware implications of LLM variants | Yash`s website