Notes

Jul. 23, 2023

Flash Attention

Much progress in AI over the past few years has been fueled by the transformer architecture. Transformers are the closest thing we have right now to machine learnable programs. They can be trained to generate images, text, videos, audio, video games, or even raw byte sequences, you name it.

Behind the transformer, powering many of these applications, there are two key operations which make up 99% of the FLOPs: attention and feed-forward layers. These are conceptually very simple, although very compute intensive. In this post, my goal is to expand on the attention operation and how to efficiently implement it using an algorithm called Flash Attention.