pascal-attn

A memory-efficient tiled attention implementation for NVIDIA Pascal GPUs — because the math doesn't require tensor cores, and someone had to actually do it.

active aipythonpytorchopen-source

Available on PyPI and GitHub.

pip install pascal-attn

Why this exists

Flash Attention didn’t make Pascal a hard requirement. The broader AI ecosystem just moved on and left Pascal below the entry ceiling for doing any kind of serious home AI work. Tensor cores, bf16 support, newer memory architectures — libraries started assuming all of it, and Pascal has none of it. It’s actually better at fp32.

The thing is, the online softmax tiling that makes Flash Attention efficient doesn’t require any of that. It’s arithmetic. There’s no architectural reason Pascal can’t do it — nobody just bothered to build it for Pascal. That annoyed me. I have Pascal hardware. I know people with a lot of it. Why should we all be locked out of memory-efficient attention because the ecosystem drew a line we didn’t need to draw?

So I built it, and started using it to train my own model, which was already a memory-hungry beast.

What it actually does

Standard attention blows up memory fast — you materialise the full attention matrix, which scales quadratically with sequence length. The tiled approach processes queries and keys in blocks, keeping running statistics (max, denominator, output accumulator) in fp32 for numerical stability, and never materialising the full matrix. Memory stays roughly constant regardless of sequence length.

The other problem it solves is GQA (Grouped-Query Attention) expansion. In GQA, key/value tensors have fewer heads than queries. Normally you expand the full KV tensor upfront before computing attention — which is expensive. pascal-attn expands only the slice needed per tile, so the overhead is constant rather than proportional to sequence length.

Tile sizes are automatically tuned to fit the GPU’s L2 cache, which is where most of the QK matmul time goes. Pascal gets 64, Volta/Turing get 128, Ampere and up get 256. recommended_tile_size() detects and returns the right value automatically.

The results for a 2048-length sequence with 32 query heads, 8 KV heads, fp16:

ApproachPeak memory
Naive~620 MB
pascal-attn (tile=64)~5 MB

Across a 28-layer model, GQA expansion alone saves around 2.4 GB.

HuggingFace integration

Drop-in registration for Transformers-based models:

from pascal_attn.hf import register_with_transformers
register_with_transformers(tile_size=64, name="pascal_chunked")

Current state

It works. I’m using it. It’s on PyPI for whoever else is in the same situation — Pascal hardware, serious workloads, tired of being told to buy a newer GPU. Not claiming it solves every problem, but it solved mine.