I’m happy to announce my latest open‑source project Sharded Suite
it’s sharding + caching system that lets you load 70‑billion‑parameter language models on a single 24 GB GPU without quantization or multi‑GPU setups.
it works like this:
Page‑level sharding
The model file is sliced into tiny 4 KB “pages,” each with its own CRC checksum.
Atlas GPU cache
A CUDA LRU cache keeps only the hot pages in VRAM and streams the rest from disk.
Zero‑copy, hardware CRC
Integrity is verified on‑GPU at ~0.03 ms per page, so there’s practically no overhead.
allowing to save 7‑10× VRAM while holding 85‑95 % of the original throughput.
it’s on my GitHub https://github.com/idkcallme/Sharded-Suite.git
I got a few ideas on what to do with it, but I would love if I can get ideas on what to improve or more of what I can use it for.
So, for I can use it for
Running LLaMA‑style model locally for research/tinkering and cutting cloud GPU costs by swapping gigantic instances for a single GPU box.