This is a TIL of a TIL.
Simon Willison wrong of using an NVIDIA DGX Spark with an Apple Mac Studio for faster inference, with all the details here.
He talks about the two phases of inference:
- Prefill: Influence Time-To-First-Token (TTFT)
- Decode: Influence Tokens Per Second (TPS)
Prefill
- Read the prompt
- Build a Key-Value Cache for each transformer layer in the model
Bound by compute as it initializes the model’s internal state and does a lot of matrix multiplication.
Decode
- Read the prompt
- Build a Key-Value Cache for each transformer layer in the model
Bound by memory bandwidth as it leverage the KV cache from the prefill.