TIL: The two phases in LLM inference
This is a TIL of a TIL. Simon Willison wrong of using an NVIDIA DGX Spark with an Apple Mac Studio for faster inference, with all the details here. He talks about the two phases of inference: Prefill: Influence Time-To-First-Token (TTFT) Decode: Influence Tokens Per Second (TPS) Prefill Read the prompt Build a Key-Value Cache for each transformer layer in the model Bound by compute as it initializes the model’s internal state and does a lot of matrix multiplication. ...