Performance | 🦉 olshansky 🦁

TIL: The two phases in LLM inference

This is a TIL of a TIL. Simon Willison wrong of using an NVIDIA DGX Spark with an Apple Mac Studio for faster inference, with all the details here. He talks about the two phases of inference: Prefill: Influence Time-To-First-Token (TTFT) Decode: Influence Tokens Per Second (TPS) Prefill Read the prompt Build a Key-Value Cache for each transformer layer in the model Bound by compute as it initializes the model’s internal state and does a lot of matrix multiplication. ...

Python 3.9 StatsProfile — My first OSS Contribution to cPython

You can try out all of the code in this article yourself using this Google Colaboratory notebook. If you’ve ever tried to debug and optimize your python application, it’s likely that you stumbled upon Python Profiles to understand where most of the execution time is being spent. You enable the profiler at the beginning of a code segment you’re interested in profiling with pr.enable(), and call pr.create_stats() at the end. View code on GitHub Gist ...