This is a TIL of a TIL.

Simon Willison wrong of using an NVIDIA DGX Spark with an Apple Mac Studio for faster inference, with all the details here.

He talks about the two phases of inference:

  1. Prefill: Influence Time-To-First-Token (TTFT)
  2. Decode: Influence Tokens Per Second (TPS)

Prefill

  1. Read the prompt
  2. Build a Key-Value Cache for each transformer layer in the model

Bound by compute as it initializes the model’s internal state and does a lot of matrix multiplication.

Decode

  1. Read the prompt
  2. Build a Key-Value Cache for each transformer layer in the model

Bound by memory bandwidth as it leverage the KV cache from the prefill.