Home » Thoughts

TIL: The two phases in LLM inference

November 2, 2025 · 1 min · 107 words

Table of Contents

This is a TIL of a TIL.

Simon Willison wrong of using an NVIDIA DGX Spark with an Apple Mac Studio for faster inference, with all the details here.

He talks about the two phases of inference:

Prefill: Influence Time-To-First-Token (TTFT)
Decode: Influence Tokens Per Second (TPS)

Prefill

Read the prompt
Build a Key-Value Cache for each transformer layer in the model

Bound by compute as it initializes the model’s internal state and does a lot of matrix multiplication.

Decode

Read the prompt
Build a Key-Value Cache for each transformer layer in the model

Bound by memory bandwidth as it leverage the KV cache from the prefill.